thr3ads.net - llvm dev - [llvm-dev] Question about generated code for x86 vpgather* intrinsics [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Jackson Davis via llvm-dev

2017-Dec-19 00:52 UTC

[llvm-dev] Question about generated code for x86 vpgather* intrinsics

Hi,

I've been looking into some basic vectorized code using
 _mm256_i32gather_epi32 for vpgatherdd. In a basic function I've been
testing, I'm a little confused that it zeroes out the result before the
gather (example: https://godbolt.org/g/zQzn56). Shouldn't this be
unnecessary when the mask is not specified by the intrinsic (and therefor
set for every element), in which case the resulting ymm register will be
fully loaded? The IR seems to specify a pre-zeroed result if I'm
understanding things correctly:


%8 = tail call <8 x i32> @llvm.x86.avx2.gather.d.d.256(<8 x i32>
zeroinitializer, i8* %1, <8 x i32> %7, <8 x i32> <i32 -1, i32 -1,
i32 -1,
i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>, i8 1)

Am I correct in thinking this initialization is unnecessary here given than
the mask is all -1? I'm not really concerned with the cost of the
initialization as much as I am concerned that I don't fully understand the
semantics of the instruction/intrinsic :)

Thanks,
-Jackson
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171218/e1e2ee30/attachment.html>

Craig Topper via llvm-dev

2017-Dec-19 23:24 UTC

head link

[llvm-dev] Question about generated code for x86 vpgather* intrinsics

You are correct that the initialization is unnecessary for the calculation
of the result of the instruction.

It's zero in the IR because the intrinsic header file uses
_mm256_undefined_si256() which is defined as zero for other reasons. See
llvm.org/PR32176. We don't currently have a convenient way to put undef in
the IR from C code.

#define _mm256_i32gather_epi32(m, i, s) __extension__ ({ \
  (__m256i)__builtin_ia32_gatherd_d256((__v8si)_mm256_undefined_si256(), \
                                       (int const *)(m),
(__v8si)(__m256i)(i), \
                                       (__v8si)_mm256_set1_epi32(-1), (s));
})

Now the backend could detect that mask is all 1s and remove the zero. But
the backend knows one additional thing about the gather instructions. Even
though the mask is all ones, the scheduler in the CPU doesn't know that
until the gather instruction executes. That means the scheduler has to
conservatively assume that the passthru input may be used by the
instruction so the gather can't execute until the last writer of whatever
register is chosen has executed and produced its result. Even though that
result isn't going to be used by the gather. This is a false scheduling
dependency. To break the dependency we emit an explicit zeroing with an xor
which has special treatment in the CPU. The xor result will be considered
ready without ever executing and the gather won't wait for it.

The backend will replace any non-zero or undef value with zero when it can
prove the mask is all ones.

We could be smarter and try to find a register that hasn't been written in
a while and use the zeroing as a last resort, but that's harder. We should
maybe not emit the zero with -Os or -Oz either, but no one has complained
about that yet.

~Craig

On Mon, Dec 18, 2017 at 4:52 PM, Jackson Davis via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> I've been looking into some basic vectorized code using
>  _mm256_i32gather_epi32 for vpgatherdd. In a basic function I've been
> testing, I'm a little confused that it zeroes out the result before the
> gather (example: https://godbolt.org/g/zQzn56). Shouldn't this be
> unnecessary when the mask is not specified by the intrinsic (and therefor
> set for every element), in which case the resulting ymm register will be
> fully loaded? The IR seems to specify a pre-zeroed result if I'm
> understanding things correctly:
>
>
> %8 = tail call <8 x i32> @llvm.x86.avx2.gather.d.d.256(<8 x
i32>
> zeroinitializer, i8* %1, <8 x i32> %7, <8 x i32> <i32 -1,
i32 -1, i32 -1,
> i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>, i8 1)
>
> Am I correct in thinking this initialization is unnecessary here given
> than the mask is all -1? I'm not really concerned with the cost of the
> initialization as much as I am concerned that I don't fully understand
the
> semantics of the instruction/intrinsic :)
>
> Thanks,
> -Jackson
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171219/db7dd034/attachment.html>

llvm dev - Dec 2017 - Question about generated code for x86 vpgather* intrinsics

[llvm-dev] Question about generated code for x86 vpgather* intrinsics

[llvm-dev] Question about generated code for x86 vpgather* intrinsics