thr3ads.net - search: "vectorize

Displaying 20 results from an estimated 21 matches for "vectorize_width".

Vectorization width not correct using #pragma clang loop vectorize_width

2018 Sep 20

Vectorization width not correct using #pragma clang loop vectorize_width

Hello, I m trying to set vector width using #pragma clang loop vectorize_width(32) but i m getting width 8 for the following kernel; #define M 128 #define N 128 #define SQRT_FUN(x) sqrtf(x) int main(int argc, char** argv) { /* Variable declaration/allocation. */ double float_n = (double)N; double data[N*M]; double corr[M*M]; double mean[M]; double st...

vectorize.enable

2019 Oct 02

vectorize.enable

Hi Michael and Florian, ( + llvm-dev for visibility) I would like to quickly follow up on "Pragma vectorize_width() implies vectorize(enable)", which got reverted with commit 858a1ae for 2 reasons, see also that revert commit message. Ignore the assert, that's been fixed now. The other thing is that with the patch behaviour is slightly changed and we could get a diagnostic we didn't get before:...

Invoke loop vectorizer

2016 Aug 12

Invoke loop vectorizer

...t? If so, should the vectorized code on IR level? On Aug 12, 2016 11:39 AM, "Daniel Berlin" <dberlin at dberlin.org> wrote: > cat > test.c > > #define SIZE 128 > > void bar(int *restrict A, int* restrict B,int K) { > > #pragma clang loop vectorize(enable) vectorize_width(2) unroll_count(8) > > for (int i = 0; i < SIZE; ++i) > > A[i] += B[i] + K; > > } > > [dannyb at dannyb-macbookpro3 11:37:20] ~ :) $ clang -O3 test.c -c > -save-temps > [dannyb at dannyb-macbookpro3 11:38:28] ~ :) $ pcregrep -i "^\s*p" > test.s|l...

Invoke loop vectorizer

2016 Aug 12

Invoke loop vectorizer

...t A and B don't alias in this example. It's > almost certainly not profitable to add a runtime check given the size of > the loop. > > > try > > #define SIZE 8 > > void bar(int *restrict A, int* restrict B,int K) { > > #pragma clang loop vectorize(enable) vectorize_width(2) unroll_count(8) > > for (int i = 0; i < SIZE; ++i) > > A[i] += B[i] + K; > > } > > (i don't remember if llvm also does runtime alias checks, but if it does, > you'd probably need to increase size to get it to vectorize) > > On Fri, Aug 12, 2016 a...

Invoke loop vectorizer

2016 Aug 12

Invoke loop vectorizer

.... I found even when loop vectorizer and SLP vectorizer are enabled, my simple test still not get optimized. I also tried clang pragma in my test to force vectorization. What do you think is the problem? Test: #define SIZE 8 void bar(int *A, int* B,int K) { #pragma clang loop vectorize(enable) vectorize_width(2) unroll_count(8) for (int i = 0; i < SIZE; ++i) A[i] += B[i] + K; } Thanks, Xiaochu On Aug 12, 2016 4:06 AM, "Andrey Bokhanko" <andreybokhanko at gmail.com> wrote: > Hi Xiaochu, > > Clang uses -O0 by default, that doesn't run any optimizations. Try >...

vectorize.enable

2019 Oct 02

vectorize.enable

...nd after my commit, the loop vectoriser was bailing because "Not vectorizing: The exiting block is not the loop latch". The source looks like a straightforward canonical loop. What pass transformed it to have code between the exiting block and the latch? > But the difference is that vectorize_width() now implies vectorize(enable), and so this is now marked as forced vectorisation which wasn't the case before. Because of this forced vectorization, and that the transformation wasn't applied, we now emit this diagnostic. The first part of this diagnostic is spot on: "the optimizer w...

Loop vectorization and unsafe floating point math

2020 Jun 24

Loop vectorization and unsafe floating point math

...org/z/fzRHsp ) //------------------------------------------------------------------ // // clang -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize #include <stdio.h> #include <stdint.h> double v_1 = -902.30847021; double v_2 = -902.30847021; int main() { #pragma clang loop vectorize_width(2) unroll(disable) for (int i = 0; i < 16; ++i) { v_1 = v_1 * 430.33975544; } #pragma clang loop unroll(disable) for (int i = 0; i < 16; ++i) { v_2 = v_2 * 430.33975544; } printf("v_1: %f\n", v_1); printf("v_2: %f\n", v_2); } // //-----------------...

Question about __builtin_assume()

2015 Dec 22

Question about __builtin_assume()

void test_copy_vec(const short* restrict src, short* restrict res, int N) { __builtin_assume( (N > 1) && (N%2 == 0) ); #pragma clang loop vectorize(enable) vectorize_width(2) interleave_count(1) for (int j=0; j<N; ++j) *res++ = *src++; } If I use __builtin_assume(N>1) then llvm knows the loop will execute and not check for (j <= 0), but I can't seem to get it to accept N is even. Is there a way to get llvm to vectorize the loop and not generate th...

loop vectorizer disabling

2019 Sep 10

loop vectorizer disabling

...o. We added a new pragma [1], and enabling this new transformation option implies setting the transformation [2]. This is something that our docs promise for other transformation options too, except that this wasn't happening and so we started fixing that. In [3] for example, we implement that `vectorize_width()` implies `vectorize(enable)`. Related to this, we started discussing in [4] what `vectorize(disable)` should mean easier of [3], because it makes implementation easier but more importantly because that would probably match user expectations better. [1] https://reviews.llvm.org/D64744 [2] https:...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

[RFC][VECLIB] how should we legalize VECLIB calls?

Illustrative Example: clang -fveclib=SVML -O3 svml.c -mavx #include <math.h> void foo(double *a, int N){ int i; #pragma clang loop vectorize_width(8) for (i=0;i<N;i++){ a[i] = sin(i); } } Currently, this results in a call to <8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good. Later on, standard vector type legali...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...Or if there was only one add/mul > inside the loop we'd have to reduce its width and the width of the phi. > > > Can you explain how the desired code from the vectorizer differs from the > code that the vectorizer produces if you add '#pragma clang loop > vectorize(enable) vectorize_width(16)' above the loop? I tried it in your > godbolt example and the generated code looks very similar to the > icc-generated code. > It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily carried across the loop. It's then redundantly added twice in the reductio...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...Or if there was only one add/mul inside the loop >> we'd have to reduce its width and the width of the phi. > > Can you explain how the desired code from the vectorizer differs from > the code that the vectorizer produces if you add '#pragma clang loop > vectorize(enable) vectorize_width(16)' above the loop? I tried it in > your godbolt example and the generated code looks very similar to the > icc-generated code. (specifically, I mean this: https://godbolt.org/g/LJA38e) > > Thanks again, > Hal > >> >> Thanks, >> ~Craig > > -- >...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

Hello all, This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...add/mul >> inside the loop we'd have to reduce its width and the width of the phi. >> >> >> Can you explain how the desired code from the vectorizer differs from the >> code that the vectorizer produces if you add '#pragma clang loop >> vectorize(enable) vectorize_width(16)' above the loop? I tried it in your >> godbolt example and the generated code looks very similar to the >> icc-generated code. >> > > It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily > carried across the loop. It's then redundantly a...

Invoke loop vectorizer

2016 Aug 11

Invoke loop vectorizer

Hi there , I use clang-cl /Qvec test.c to compile the code. But the pass LoopVectorizer is never invoked. I was wondering if this is sufficient to enable auto vectorizer? Thanks, Xiaochu -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160811/8b6cb760/attachment.html>

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

[RFC][VECLIB] how should we legalize VECLIB calls?

...-dev' <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: [llvm-dev] [RFC][VECLIB] how should we legalize VECLIB calls? Illustrative Example: clang -fveclib=SVML -O3 svml.c -mavx #include <math.h> void foo(double *a, int N){ int i; #pragma clang loop vectorize_width(8) for (i=0;i<N;i++){ a[i] = sin(i); } } Currently, this results in a call to <8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good. Later on, standard vector type legali...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

[RFC][VECLIB] how should we legalize VECLIB calls?

...gt;> >> >> >> Illustrative Example: >> >> >> >> clang -fveclib=SVML -O3 svml.c -mavx >> >> >> >> #include <math.h> >> >> void foo(double *a, int N){ >> >> int i; >> >> #pragma clang loop vectorize_width(8) >> >> for (i=0;i<N;i++){ >> >> a[i] = sin(i); >> >> } >> >> } >> >> >> >> Currently, this results in a call to <8 x double> __svml_sin8(<8 x >> double>) after the vectorizer. >> >> Th...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

[RFC][VECLIB] how should we legalize VECLIB calls?

...llvm-dev] [RFC][VECLIB] how should we legalize VECLIB calls? > > > > > > Illustrative Example: > > > > clang -fveclib=SVML -O3 svml.c -mavx > > > > #include <math.h> > > void foo(double *a, int N){ > > int i; > > #pragma clang loop vectorize_width(8) > > for (i=0;i<N;i++){ > > a[i] = sin(i); > > } > > } > > > > Currently, this results in a call to <8 x double> __svml_sin8(<8 x > double>) after the vectorizer. > > This is 8-element SVML sin() called with 8-element argument. O...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...one add/mul inside the loop we'd have to reduce > its width and the width of the phi. > > > Can you explain how the desired code from the vectorizer differs > from the code that the vectorizer produces if you add '#pragma > clang loop vectorize(enable) vectorize_width(16)' above the loop? > I tried it in your godbolt example and the generated code looks > very similar to the icc-generated code. > > > > Vectorizer considers the largest data type size in the loop body and > considers the maximum possible VF as 8, hence in this e...

[RFC] Allow loop vectorizer to choose vector widths that generate illegal types

2016 Jun 15

[RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Hello, Currently the loop vectorizer will, by default, not consider vectorization factors that would make it generate types that do not fit into the target platform's vector registers. That is, if the widest scalar type in the scalar loop is i64, and the platform's largest vector register is 256-bit wide, we will not consider a VF above 4. We have a command line option (-mllvm

search for: vectorize_width