thr3ads.net - llvm dev - [llvm-dev] Question about llvm vectors [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Alexandre Bique via llvm-dev

2020-Aug-20 09:16 UTC

[llvm-dev] Question about llvm vectors

Hi Craig,

Thank you very much for your answer.

I did not want to discuss exactly the semantic and name of one operation
but instead raise the question "would it be beneficial to have more vector
builtins?".

You wrote that the compiler will recognize a pattern and replace it by
__builtin_ia32_haddps when possible, but how can I be sure of that? I would
have to disassemble the generated code right? It is very
impractical isn'it? And it leads me to understand that each CPU target has
a bank of patterns which it can recognize but wouldn't it be very similar
to have advanced generic vector operations and CPU specific implementation
for those builtins?

Regarding hadd; I agree, the name does not very well describe what it is
doing. And yes hadd could be summing all the vector elements, but I think
that the usual terminology for that is reduce_add.

In my case I use it for computing the mono signal of a stereo interleaved
signal:

a = load(in);
b = load(in + K);
l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a
r = suffle(a, b, 1, 3, 5, 7, ...);
m = .5 * (l + r); // m has the same size as a and b which is maybe optimal
for memory I/O?
store(m, out);

As you said it, I could have m being half of the size of a, and I would not
need to load b. Which approach would deliver the best performance? Does the
compiler recognize both? Maybe there is another valid approach, will the
compiler recognize it?

I would like also to discuss reduce_add, there might be multiple ways of
doing it right but is there one that is faster? Is the same approach always
the best or it depends on the CPU? I believe that those questions are best
answered by the compiler.

Then some side-notes regarding clang documentation __builtin_shufflevector
is not referenced there
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors

Best regards,
Alexandre Bique

On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com>
wrote:
> I'm not sure everyone would agree that the behavior of a
> __builtin_vector_hadd should do what the X86 instruction does. It takes two
> vectors and produces a result with elements from both vectors. Someone
> might argue that a horizontal add should just take one source and produce a
> vector with half the number of elements. Someone else might argue that a
> horizontal add should sum all the elements to a single scalar value. With
> different implementation choices like that its hard to say it should be a
> generic operation when the behavior might only make sense for one
target's
> instruction set.
>
> The behavior of the 256-bit vhaddps instruction on X86 is also weird since
> it treats the upper and lower 128-bits of the sources and destination
> independently. That quirk wouldn't make sense in a generic operation.
>
> You can emulate __builtin_ia32_haddps generically using
> __builtin_shufflevector and the + operator.  The X86 backend should
> recognize it and use haddps.
>
> ~Craig
>
>
> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> I love llvm vectors, yet I wonder why some advanced vector operations
are
>> specific to some CPU targets?
>>
>> Let me take an example:
>>
>> /// Horizontally adds the adjacent pairs of values contained in two
>> ///    128-bit vectors of [4 x float].
>> ///
>> /// \headerfile <x86intrin.h>
>> ///
>> /// This intrinsic corresponds to the <c> VHADDPS </c>
instruction.
>> ///
>> /// \param __a
>> ///    A 128-bit vector of [4 x float] containing one of the source
>> operands.
>> ///    The horizontal sums of the values are stored in the lower bits
of
>> the
>> ///    destination.
>> /// \param __b
>> ///    A 128-bit vector of [4 x float] containing one of the source
>> operands.
>> ///    The horizontal sums of the values are stored in the upper bits
of
>> the
>> ///    destination.
>> /// \returns A 128-bit vector of [4 x float] containing the horizontal
>> sums of
>> ///    both operands.
>> static __inline__ __m128 __DEFAULT_FN_ATTRS
>> _mm_hadd_ps(__m128 __a, __m128 __b)
>> {
>>   return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);
>> }
>>
>> Here clang will translate _mm_hadd_ps to a CPU specific feature.
>> Why not create __builtin_vector_hadd(a, b) which would select the CPU
>> specific instruction or a fallback generic implementation?
>>
>> Many thanks,
>> Alex
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200820/5aff0a0d/attachment.html>

Craig Topper via llvm-dev

2020-Aug-21 19:08 UTC

head link

[llvm-dev] Question about llvm vectors

__builtin_shufflevector was supposed to be linked here
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
but due to a mistake in the source file its generated from a link was made
to __builtin_shufflevector instead. I've fixed that and it should hopefully
update in the next day or two.

We have internal intrinsics for reduce_add that are used by the
autovectorizers. I could see it making sense to expose those to C as a
builtin. For X86 I think we always reduce at each stage by moving the upper
half of the vector to the lower half with a shuffle and then adding it to
the lower half. I think on some CPUs we use haddps/haddpd to do the last
stage of combining element 1 with element 0. But most CPUs we use a shuffle
and a addps/addpd. Intel CPUs use 2 shuffles and addps/addpd internally to
implement haddps/haddpd. And on Intel CPUs there's only one execution unit
that can do the 2 shuffles. So they execute serially before the
addps/addpd. So for reductions it is better just emit a single shuffle in
assembly than to use haddps/pd.

~Craig


On Thu, Aug 20, 2020 at 2:17 AM Alexandre Bique <bique.alexandre at
gmail.com>
wrote:
> Hi Craig,
>
> Thank you very much for your answer.
>
> I did not want to discuss exactly the semantic and name of one operation
> but instead raise the question "would it be beneficial to have more
vector
> builtins?".
>
> You wrote that the compiler will recognize a pattern and replace it by
> __builtin_ia32_haddps when possible, but how can I be sure of that? I would
> have to disassemble the generated code right? It is very
> impractical isn'it? And it leads me to understand that each CPU target
has
> a bank of patterns which it can recognize but wouldn't it be very
similar
> to have advanced generic vector operations and CPU specific implementation
> for those builtins?
>
> Regarding hadd; I agree, the name does not very well describe what it is
> doing. And yes hadd could be summing all the vector elements, but I think
> that the usual terminology for that is reduce_add.
>
> In my case I use it for computing the mono signal of a stereo interleaved
> signal:
>
> a = load(in);
> b = load(in + K);
> l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a
> r = suffle(a, b, 1, 3, 5, 7, ...);
> m = .5 * (l + r); // m has the same size as a and b which is maybe optimal
> for memory I/O?
> store(m, out);
>
> As you said it, I could have m being half of the size of a, and I would
> not need to load b. Which approach would deliver the best performance? Does
> the compiler recognize both? Maybe there is another valid approach, will
> the compiler recognize it?
>
> I would like also to discuss reduce_add, there might be multiple ways of
> doing it right but is there one that is faster? Is the same approach always
> the best or it depends on the CPU? I believe that those questions are best
> answered by the compiler.
>
> Then some side-notes regarding clang documentation __builtin_shufflevector
> is not referenced there
>
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
>
> Best regards,
> Alexandre Bique
>
>
> On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at
gmail.com>
> wrote:
>
>> I'm not sure everyone would agree that the behavior of a
>> __builtin_vector_hadd should do what the X86 instruction does. It takes
two
>> vectors and produces a result with elements from both vectors. Someone
>> might argue that a horizontal add should just take one source and
produce a
>> vector with half the number of elements. Someone else might argue that
a
>> horizontal add should sum all the elements to a single scalar value.
With
>> different implementation choices like that its hard to say it should be
a
>> generic operation when the behavior might only make sense for one
target's
>> instruction set.
>>
>> The behavior of the 256-bit vhaddps instruction on X86 is also weird
>> since it treats the upper and lower 128-bits of the sources and
destination
>> independently. That quirk wouldn't make sense in a generic
operation.
>>
>> You can emulate __builtin_ia32_haddps generically using
>> __builtin_shufflevector and the + operator.  The X86 backend should
>> recognize it and use haddps.
>>
>> ~Craig
>>
>>
>> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi,
>>>
>>> I love llvm vectors, yet I wonder why some advanced vector
operations
>>> are specific to some CPU targets?
>>>
>>> Let me take an example:
>>>
>>> /// Horizontally adds the adjacent pairs of values contained in two
>>> ///    128-bit vectors of [4 x float].
>>> ///
>>> /// \headerfile <x86intrin.h>
>>> ///
>>> /// This intrinsic corresponds to the <c> VHADDPS </c>
instruction.
>>> ///
>>> /// \param __a
>>> ///    A 128-bit vector of [4 x float] containing one of the source
>>> operands.
>>> ///    The horizontal sums of the values are stored in the lower
bits of
>>> the
>>> ///    destination.
>>> /// \param __b
>>> ///    A 128-bit vector of [4 x float] containing one of the source
>>> operands.
>>> ///    The horizontal sums of the values are stored in the upper
bits of
>>> the
>>> ///    destination.
>>> /// \returns A 128-bit vector of [4 x float] containing the
horizontal
>>> sums of
>>> ///    both operands.
>>> static __inline__ __m128 __DEFAULT_FN_ATTRS
>>> _mm_hadd_ps(__m128 __a, __m128 __b)
>>> {
>>>   return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);
>>> }
>>>
>>> Here clang will translate _mm_hadd_ps to a CPU specific feature.
>>> Why not create __builtin_vector_hadd(a, b) which would select the
CPU
>>> specific instruction or a fallback generic implementation?
>>>
>>> Many thanks,
>>> Alex
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200821/54cb8fbc/attachment.html>

Alexandre Bique via llvm-dev

2020-Aug-21 20:31 UTC

head link

[llvm-dev] Question about llvm vectors

Thank you very much for the explanation.

I have one more question: it is possible in LLVM IR to call sin() on a
vector. Yet I did not find how to do it with clang and I've tried various
things:

#include <cmath>

using vec = float __attribute__((__vector_size__(4 * 4)));

vec fct(vec a)
{
  vec b = std::exp(a);
  //vec b = __builtin_exp(a);
  //vec b{std::exp(a[0]), std::exp(a[1]), std::exp(a[2]), std::exp(a[3])};
  //vec b{__builtin_expf(a[0]), __builtin_expf(a[1]), __builtin_expf(a[2]),
__builtin_expf(a[3])};
  return b;
}

Do you know how to do that?

Regards,
Alexandre Bique


On Fri, Aug 21, 2020 at 9:09 PM Craig Topper <craig.topper at gmail.com>
wrote:
> __builtin_shufflevector was supposed to be linked here
>
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
> but due to a mistake in the source file its generated from a link was made
> to __builtin_shufflevector instead. I've fixed that and it should
hopefully
> update in the next day or two.
>
> We have internal intrinsics for reduce_add that are used by the
> autovectorizers. I could see it making sense to expose those to C as a
> builtin. For X86 I think we always reduce at each stage by moving the upper
> half of the vector to the lower half with a shuffle and then adding it to
> the lower half. I think on some CPUs we use haddps/haddpd to do the last
> stage of combining element 1 with element 0. But most CPUs we use a shuffle
> and a addps/addpd. Intel CPUs use 2 shuffles and addps/addpd internally to
> implement haddps/haddpd. And on Intel CPUs there's only one execution
unit
> that can do the 2 shuffles. So they execute serially before the
> addps/addpd. So for reductions it is better just emit a single shuffle in
> assembly than to use haddps/pd.
>
> ~Craig
>
>
> On Thu, Aug 20, 2020 at 2:17 AM Alexandre Bique <bique.alexandre at
gmail.com>
> wrote:
>
>> Hi Craig,
>>
>> Thank you very much for your answer.
>>
>> I did not want to discuss exactly the semantic and name of one
operation
>> but instead raise the question "would it be beneficial to have
more vector
>> builtins?".
>>
>> You wrote that the compiler will recognize a pattern and replace it by
>> __builtin_ia32_haddps when possible, but how can I be sure of that? I
would
>> have to disassemble the generated code right? It is very
>> impractical isn'it? And it leads me to understand that each CPU
target has
>> a bank of patterns which it can recognize but wouldn't it be very
similar
>> to have advanced generic vector operations and CPU specific
implementation
>> for those builtins?
>>
>> Regarding hadd; I agree, the name does not very well describe what it
is
>> doing. And yes hadd could be summing all the vector elements, but I
think
>> that the usual terminology for that is reduce_add.
>>
>> In my case I use it for computing the mono signal of a stereo
interleaved
>> signal:
>>
>> a = load(in);
>> b = load(in + K);
>> l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a
>> r = suffle(a, b, 1, 3, 5, 7, ...);
>> m = .5 * (l + r); // m has the same size as a and b which is maybe
>> optimal for memory I/O?
>> store(m, out);
>>
>> As you said it, I could have m being half of the size of a, and I would
>> not need to load b. Which approach would deliver the best performance?
Does
>> the compiler recognize both? Maybe there is another valid approach,
will
>> the compiler recognize it?
>>
>> I would like also to discuss reduce_add, there might be multiple ways
of
>> doing it right but is there one that is faster? Is the same approach
always
>> the best or it depends on the CPU? I believe that those questions are
best
>> answered by the compiler.
>>
>> Then some side-notes regarding clang
>> documentation __builtin_shufflevector is not referenced there
>>
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
>>
>> Best regards,
>> Alexandre Bique
>>
>>
>> On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at
gmail.com>
>> wrote:
>>
>>> I'm not sure everyone would agree that the behavior of a
>>> __builtin_vector_hadd should do what the X86 instruction does. It
takes two
>>> vectors and produces a result with elements from both vectors.
Someone
>>> might argue that a horizontal add should just take one source and
produce a
>>> vector with half the number of elements. Someone else might argue
that a
>>> horizontal add should sum all the elements to a single scalar
value. With
>>> different implementation choices like that its hard to say it
should be a
>>> generic operation when the behavior might only make sense for one
target's
>>> instruction set.
>>>
>>> The behavior of the 256-bit vhaddps instruction on X86 is also
weird
>>> since it treats the upper and lower 128-bits of the sources and
destination
>>> independently. That quirk wouldn't make sense in a generic
operation.
>>>
>>> You can emulate __builtin_ia32_haddps generically using
>>> __builtin_shufflevector and the + operator.  The X86 backend should
>>> recognize it and use haddps.
>>>
>>> ~Craig
>>>
>>>
>>> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> I love llvm vectors, yet I wonder why some advanced vector
operations
>>>> are specific to some CPU targets?
>>>>
>>>> Let me take an example:
>>>>
>>>> /// Horizontally adds the adjacent pairs of values contained in
two
>>>> ///    128-bit vectors of [4 x float].
>>>> ///
>>>> /// \headerfile <x86intrin.h>
>>>> ///
>>>> /// This intrinsic corresponds to the <c> VHADDPS
</c> instruction.
>>>> ///
>>>> /// \param __a
>>>> ///    A 128-bit vector of [4 x float] containing one of the
source
>>>> operands.
>>>> ///    The horizontal sums of the values are stored in the
lower bits
>>>> of the
>>>> ///    destination.
>>>> /// \param __b
>>>> ///    A 128-bit vector of [4 x float] containing one of the
source
>>>> operands.
>>>> ///    The horizontal sums of the values are stored in the
upper bits
>>>> of the
>>>> ///    destination.
>>>> /// \returns A 128-bit vector of [4 x float] containing the
horizontal
>>>> sums of
>>>> ///    both operands.
>>>> static __inline__ __m128 __DEFAULT_FN_ATTRS
>>>> _mm_hadd_ps(__m128 __a, __m128 __b)
>>>> {
>>>>   return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);
>>>> }
>>>>
>>>> Here clang will translate _mm_hadd_ps to a CPU specific
feature.
>>>> Why not create __builtin_vector_hadd(a, b) which would select
the CPU
>>>> specific instruction or a fallback generic implementation?
>>>>
>>>> Many thanks,
>>>> Alex
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200821/9fdc7e6d/attachment.html>

llvm dev - Aug 2020 - Question about llvm vectors

[llvm-dev] Question about llvm vectors

[llvm-dev] Question about llvm vectors

[llvm-dev] Question about llvm vectors