thr3ads.net - llvm dev - [llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives [Sep 2021]

If this information is useful, please help other people find it:
Share via:

qiaopeixin via llvm-dev

2021-Sep-08 09:41 UTC

[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Hi,

I would like to discuss the behaviors of openmp `simd` and `ordered simd`
directives. I think current Clang may not give expected results as OpenMP 5.0
standard defines.

Let's start one c++ example:
```
void func(float *a, float *b, float *c, float *d, int N) {
  #pragma omp simd
  for (int i = 0; i < N; i++) {
    d[i] = c[i] + 1.0;
    #pragma omp ordered simd
    a[i] = b[i] + 1.0;
  }
}
```
What is expected according to OpenMP 5.0 standard is like the following:
```
void func(float *a, float *b, float *c, float *d, int N) {
  for (int i = 0; i < N; i += 4) {
    #pragma omp simd
    for (int j = i; j < 4; j++)
      d[i] = c[i] + 1.0; // vectorized

    for (int j = i; j < 4; j++)
      a[i] = b[i] + 1.0; // not vectorized
  }
}
```
It seems that current Clang and LLVM do not support it.

Without openmp enabled, clang vectorizes the loop with memcheck as follows:
```
$ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
  %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
  %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
  %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
  %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
  %bound030 = icmp ugt float* %scevgep25, %d
  %bound131 = icmp ugt float* %scevgep, %c
  %found.conflict32 = and i1 %bound030, %bound131
  ... fadd <4 x float> ...
```

With openmp-simd enabled, clang vectorizes the loop without memcheck. This means
that only `simd` directive is enabled, while `ordered simd` directive is
disabled. The results are expected.
```
clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
```

With openmp enabled, both `simd` and `ordered simd` directives are enabled.
Clang frontend generates the outlined function `captured_stmt(float** %a.addr,
i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization
level is more than 0. The generated IR is to vectorize the loop with memcheck as
follows:
```
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
  %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
  %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
  %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
  %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
  %bound037 = icmp ugt float* %scevgep32, %d
  %bound138 = icmp ugt float* %scevgep, %c
  %found.conflict39 = and i1 %bound037, %bound138
  ... fadd <4 x float> ...
```
But the expected IR should be like the following:
```
  %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
  %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
  %found.conflict...
  ... fadd <4 x float> ...
```
I have two questions here:
1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float**
%b.addr)` with `AlwaysInline` attribute cause the memcheck? And how?
2. If my understanding is correct according to the above analysis, should the
codegen of `ordered simd` directive be fixed to support the expected behaviors?
And should `memcheck` function (emitMemRuntimeChecks) also support partial check
instead of the whole region inside the loop?

Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;`
and `a[i] = a[i-1] + 1.0;` are disabled.
```
void func(float *a, float *b, float *c, float *d, int N) {
  #pragma omp simd
  for (int i = 1; i < N; i++) {
    d[i] = c[i] + 1.0;
    #pragma omp ordered simd
    a[i] = a[i-1] + 1.0;
  }
}
```
What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`.
I also test icc and gcc and here are the results:
```
$ icc -v
icc version 2021.1
$ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S &&
cat test.optrpt
LOOP BEGIN at test.cpp(3,3)
   remark #15531: Block of statements was serialized due to user request   [
test.cpp(5,5) ]
   remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END
$ g++ -v
gcc version 9.3.0 (GCC)
$ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S
&& cat test.s
fadd    s0, s0, s1 // not vectorized
...
fadd    s0, s0, s1 // not vectorized
// There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and
after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are
used in vect pass to break the vectorization.
```

For the following test case:
```
void func(float *b, float *c, float *d, int N) {
  float a[N];
  for (int i = 0; i < N; i++)
    a[i] = 0;
  #pragma omp simd
  for (int i = 1; i < N; i++) {
    d[i] = c[i] + 1.0;
    #pragma omp ordered simd
    a[i] = a[i-1] + 1.0;
  }
}
```
The IR generated is as follows:
```
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
  %scevgep = getelementptr float, float* %d, i64 1
  %3 = add nuw nsw i64 %wide.trip.count, 1
  %scevgep41 = getelementptr float, float* %d, i64 %3
  %scevgep43 = getelementptr float, float* %c, i64 1
  %scevgep45 = getelementptr float, float* %c, i64 %3
  %bound0 = icmp ult float* %scevgep, %scevgep45
  %bound1 = icmp ult float* %scevgep43, %scevgep41
  %found.conflict = and i1 %bound0, %bound1
  %induction = fadd <4 x float> %.splat, <float 0.000000e+00, float
1.000000e+00, float 2.000000e+00, float 3.000000e+00>
```
The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0`
are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] +
1.0` although it violates the definition of ordered construct in OpenMP 5.0
standard. But the memcheck of variables `d` and `c` should not be correct as the
`simd` directive is there.

All the best,
Peixin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210908/3d91e6fb/attachment.html>

Johannes Doerfert via llvm-dev

2021-Sep-08 13:35 UTC

head link

[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Hi Peixin,

First, I think CC'ing a lot of folks is not always the best strategy.
This is also a topic for openmp-dev and I'll reply there instead.

~ Johannes


On 9/8/21 4:41 AM, qiaopeixin wrote:> Hi,
>
> I would like to discuss the behaviors of openmp `simd` and `ordered simd`
directives. I think current Clang may not give expected results as OpenMP 5.0
standard defines.
>
> Let's start one c++ example:
> ```
> void func(float *a, float *b, float *c, float *d, int N) {
>    #pragma omp simd
>    for (int i = 0; i < N; i++) {
>      d[i] = c[i] + 1.0;
>      #pragma omp ordered simd
>      a[i] = b[i] + 1.0;
>    }
> }
> ```
> What is expected according to OpenMP 5.0 standard is like the following:
> ```
> void func(float *a, float *b, float *c, float *d, int N) {
>    for (int i = 0; i < N; i += 4) {
>      #pragma omp simd
>      for (int j = i; j < 4; j++)
>        d[i] = c[i] + 1.0; // vectorized
>
>      for (int j = i; j < 4; j++)
>        a[i] = b[i] + 1.0; // not vectorized
>    }
> }
> ```
> It seems that current Clang and LLVM do not support it.
>
> Without openmp enabled, clang vectorizes the loop with memcheck as follows:
> ```
> $ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
>    %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>    %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
>    %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
>    %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
>    %bound030 = icmp ugt float* %scevgep25, %d
>    %bound131 = icmp ugt float* %scevgep, %c
>    %found.conflict32 = and i1 %bound030, %bound131
>    ... fadd <4 x float> ...
> ```
>
> With openmp-simd enabled, clang vectorizes the loop without memcheck. This
means that only `simd` directive is enabled, while `ordered simd` directive is
disabled. The results are expected.
> ```
> clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
> ```
>
> With openmp enabled, both `simd` and `ordered simd` directives are enabled.
Clang frontend generates the outlined function `captured_stmt(float** %a.addr,
i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization
level is more than 0. The generated IR is to vectorize the loop with memcheck as
follows:
> ```
> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
>    %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>    %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>    %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
>    %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>    %bound037 = icmp ugt float* %scevgep32, %d
>    %bound138 = icmp ugt float* %scevgep, %c
>    %found.conflict39 = and i1 %bound037, %bound138
>    ... fadd <4 x float> ...
> ```
> But the expected IR should be like the following:
> ```
>    %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>    %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>    %found.conflict...
>    ... fadd <4 x float> ...
> ```
> I have two questions here:
> 1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3,
float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how?
> 2. If my understanding is correct according to the above analysis, should
the codegen of `ordered simd` directive be fixed to support the expected
behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support
partial check instead of the whole region inside the loop?
>
> Also, for the following test case, both of vectorization of `d[i] = c[i] +
1.0;` and `a[i] = a[i-1] + 1.0;` are disabled.
> ```
> void func(float *a, float *b, float *c, float *d, int N) {
>    #pragma omp simd
>    for (int i = 1; i < N; i++) {
>      d[i] = c[i] + 1.0;
>      #pragma omp ordered simd
>      a[i] = a[i-1] + 1.0;
>    }
> }
> ```
> What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`.
> I also test icc and gcc and here are the results:
> ```
> $ icc -v
> icc version 2021.1
> $ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S
&& cat test.optrpt
> LOOP BEGIN at test.cpp(3,3)
>     remark #15531: Block of statements was serialized due to user request  
[ test.cpp(5,5) ]
>     remark #15301: SIMD LOOP WAS VECTORIZED
> LOOP END
> $ g++ -v
> gcc version 9.3.0 (GCC)
> $ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize
-S && cat test.s
> fadd    s0, s0, s1 // not vectorized
> ...
> fadd    s0, s0, s1 // not vectorized
> // There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before
and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they
are used in vect pass to break the vectorization.
> ```
>
> For the following test case:
> ```
> void func(float *b, float *c, float *d, int N) {
>    float a[N];
>    for (int i = 0; i < N; i++)
>      a[i] = 0;
>    #pragma omp simd
>    for (int i = 1; i < N; i++) {
>      d[i] = c[i] + 1.0;
>      #pragma omp ordered simd
>      a[i] = a[i-1] + 1.0;
>    }
> }
> ```
> The IR generated is as follows:
> ```
> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
>    %scevgep = getelementptr float, float* %d, i64 1
>    %3 = add nuw nsw i64 %wide.trip.count, 1
>    %scevgep41 = getelementptr float, float* %d, i64 %3
>    %scevgep43 = getelementptr float, float* %c, i64 1
>    %scevgep45 = getelementptr float, float* %c, i64 %3
>    %bound0 = icmp ult float* %scevgep, %scevgep45
>    %bound1 = icmp ult float* %scevgep43, %scevgep41
>    %found.conflict = and i1 %bound0, %bound1
>    %induction = fadd <4 x float> %.splat, <float 0.000000e+00,
float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
> ```
> The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] +
1.0` are both unexpected. It is safe to vectorize the statement of `a[i] =
a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP
5.0 standard. But the memcheck of variables `d` and `c` should not be correct as
the `simd` directive is there.
>
> All the best,
> Peixin
>

llvm dev - Sep 2021 - About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives