qiaopeixin via llvm-dev
2021-Sep-08 09:41 UTC
[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives
Hi, I would like to discuss the behaviors of openmp `simd` and `ordered simd` directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines. Let's start one c++ example: ``` void func(float *a, float *b, float *c, float *d, int N) { #pragma omp simd for (int i = 0; i < N; i++) { d[i] = c[i] + 1.0; #pragma omp ordered simd a[i] = b[i] + 1.0; } } ``` What is expected according to OpenMP 5.0 standard is like the following: ``` void func(float *a, float *b, float *c, float *d, int N) { for (int i = 0; i < N; i += 4) { #pragma omp simd for (int j = i; j < 4; j++) d[i] = c[i] + 1.0; // vectorized for (int j = i; j < 4; j++) a[i] = b[i] + 1.0; // not vectorized } } ``` It seems that current Clang and LLVM do not support it. Without openmp enabled, clang vectorizes the loop with memcheck as follows: ``` $ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll %scevgep = getelementptr float, float* %d, i64 %wide.trip.count %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count %bound030 = icmp ugt float* %scevgep25, %d %bound131 = icmp ugt float* %scevgep, %c %found.conflict32 = and i1 %bound030, %bound131 ... fadd <4 x float> ... ``` With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only `simd` directive is enabled, while `ordered simd` directive is disabled. The results are expected. ``` clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll ``` With openmp enabled, both `simd` and `ordered simd` directives are enabled. Clang frontend generates the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows: ``` $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll %scevgep = getelementptr float, float* %d, i64 %wide.trip.count %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count %bound037 = icmp ugt float* %scevgep32, %d %bound138 = icmp ugt float* %scevgep, %c %found.conflict39 = and i1 %bound037, %bound138 ... fadd <4 x float> ... ``` But the expected IR should be like the following: ``` %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count %found.conflict... ... fadd <4 x float> ... ``` I have two questions here: 1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how? 2. If my understanding is correct according to the above analysis, should the codegen of `ordered simd` directive be fixed to support the expected behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop? Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;` and `a[i] = a[i-1] + 1.0;` are disabled. ``` void func(float *a, float *b, float *c, float *d, int N) { #pragma omp simd for (int i = 1; i < N; i++) { d[i] = c[i] + 1.0; #pragma omp ordered simd a[i] = a[i-1] + 1.0; } } ``` What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`. I also test icc and gcc and here are the results: ``` $ icc -v icc version 2021.1 $ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt LOOP BEGIN at test.cpp(3,3) remark #15531: Block of statements was serialized due to user request [ test.cpp(5,5) ] remark #15301: SIMD LOOP WAS VECTORIZED LOOP END $ g++ -v gcc version 9.3.0 (GCC) $ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s fadd s0, s0, s1 // not vectorized ... fadd s0, s0, s1 // not vectorized // There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization. ``` For the following test case: ``` void func(float *b, float *c, float *d, int N) { float a[N]; for (int i = 0; i < N; i++) a[i] = 0; #pragma omp simd for (int i = 1; i < N; i++) { d[i] = c[i] + 1.0; #pragma omp ordered simd a[i] = a[i-1] + 1.0; } } ``` The IR generated is as follows: ``` $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S %scevgep = getelementptr float, float* %d, i64 1 %3 = add nuw nsw i64 %wide.trip.count, 1 %scevgep41 = getelementptr float, float* %d, i64 %3 %scevgep43 = getelementptr float, float* %c, i64 1 %scevgep45 = getelementptr float, float* %c, i64 %3 %bound0 = icmp ult float* %scevgep, %scevgep45 %bound1 = icmp ult float* %scevgep43, %scevgep41 %found.conflict = and i1 %bound0, %bound1 %induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00> ``` The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0` are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables `d` and `c` should not be correct as the `simd` directive is there. All the best, Peixin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210908/3d91e6fb/attachment.html>
Johannes Doerfert via llvm-dev
2021-Sep-08 13:35 UTC
[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives
Hi Peixin, First, I think CC'ing a lot of folks is not always the best strategy. This is also a topic for openmp-dev and I'll reply there instead. ~ Johannes On 9/8/21 4:41 AM, qiaopeixin wrote:> Hi, > > I would like to discuss the behaviors of openmp `simd` and `ordered simd` directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines. > > Let's start one c++ example: > ``` > void func(float *a, float *b, float *c, float *d, int N) { > #pragma omp simd > for (int i = 0; i < N; i++) { > d[i] = c[i] + 1.0; > #pragma omp ordered simd > a[i] = b[i] + 1.0; > } > } > ``` > What is expected according to OpenMP 5.0 standard is like the following: > ``` > void func(float *a, float *b, float *c, float *d, int N) { > for (int i = 0; i < N; i += 4) { > #pragma omp simd > for (int j = i; j < 4; j++) > d[i] = c[i] + 1.0; // vectorized > > for (int j = i; j < 4; j++) > a[i] = b[i] + 1.0; // not vectorized > } > } > ``` > It seems that current Clang and LLVM do not support it. > > Without openmp enabled, clang vectorizes the loop with memcheck as follows: > ``` > $ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll > %scevgep = getelementptr float, float* %d, i64 %wide.trip.count > %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count > %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count > %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count > %bound030 = icmp ugt float* %scevgep25, %d > %bound131 = icmp ugt float* %scevgep, %c > %found.conflict32 = and i1 %bound030, %bound131 > ... fadd <4 x float> ... > ``` > > With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only `simd` directive is enabled, while `ordered simd` directive is disabled. The results are expected. > ``` > clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll > ``` > > With openmp enabled, both `simd` and `ordered simd` directives are enabled. Clang frontend generates the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows: > ``` > $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll > %scevgep = getelementptr float, float* %d, i64 %wide.trip.count > %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count > %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count > %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count > %bound037 = icmp ugt float* %scevgep32, %d > %bound138 = icmp ugt float* %scevgep, %c > %found.conflict39 = and i1 %bound037, %bound138 > ... fadd <4 x float> ... > ``` > But the expected IR should be like the following: > ``` > %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count > %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count > %found.conflict... > ... fadd <4 x float> ... > ``` > I have two questions here: > 1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how? > 2. If my understanding is correct according to the above analysis, should the codegen of `ordered simd` directive be fixed to support the expected behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop? > > Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;` and `a[i] = a[i-1] + 1.0;` are disabled. > ``` > void func(float *a, float *b, float *c, float *d, int N) { > #pragma omp simd > for (int i = 1; i < N; i++) { > d[i] = c[i] + 1.0; > #pragma omp ordered simd > a[i] = a[i-1] + 1.0; > } > } > ``` > What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`. > I also test icc and gcc and here are the results: > ``` > $ icc -v > icc version 2021.1 > $ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt > LOOP BEGIN at test.cpp(3,3) > remark #15531: Block of statements was serialized due to user request [ test.cpp(5,5) ] > remark #15301: SIMD LOOP WAS VECTORIZED > LOOP END > $ g++ -v > gcc version 9.3.0 (GCC) > $ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s > fadd s0, s0, s1 // not vectorized > ... > fadd s0, s0, s1 // not vectorized > // There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization. > ``` > > For the following test case: > ``` > void func(float *b, float *c, float *d, int N) { > float a[N]; > for (int i = 0; i < N; i++) > a[i] = 0; > #pragma omp simd > for (int i = 1; i < N; i++) { > d[i] = c[i] + 1.0; > #pragma omp ordered simd > a[i] = a[i-1] + 1.0; > } > } > ``` > The IR generated is as follows: > ``` > $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S > %scevgep = getelementptr float, float* %d, i64 1 > %3 = add nuw nsw i64 %wide.trip.count, 1 > %scevgep41 = getelementptr float, float* %d, i64 %3 > %scevgep43 = getelementptr float, float* %c, i64 1 > %scevgep45 = getelementptr float, float* %c, i64 %3 > %bound0 = icmp ult float* %scevgep, %scevgep45 > %bound1 = icmp ult float* %scevgep43, %scevgep41 > %found.conflict = and i1 %bound0, %bound1 > %induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00> > ``` > The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0` are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables `d` and `c` should not be correct as the `simd` directive is there. > > All the best, > Peixin >