Hi Craig,
Thank you very much for your answer.
I did not want to discuss exactly the semantic and name of one operation
but instead raise the question "would it be beneficial to have more vector
builtins?".
You wrote that the compiler will recognize a pattern and replace it by
__builtin_ia32_haddps when possible, but how can I be sure of that? I would
have to disassemble the generated code right? It is very
impractical isn'it? And it leads me to understand that each CPU target has
a bank of patterns which it can recognize but wouldn't it be very similar
to have advanced generic vector operations and CPU specific implementation
for those builtins?
Regarding hadd; I agree, the name does not very well describe what it is
doing. And yes hadd could be summing all the vector elements, but I think
that the usual terminology for that is reduce_add.
In my case I use it for computing the mono signal of a stereo interleaved
signal:
a = load(in);
b = load(in + K);
l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a
r = suffle(a, b, 1, 3, 5, 7, ...);
m = .5 * (l + r); // m has the same size as a and b which is maybe optimal
for memory I/O?
store(m, out);
As you said it, I could have m being half of the size of a, and I would not
need to load b. Which approach would deliver the best performance? Does the
compiler recognize both? Maybe there is another valid approach, will the
compiler recognize it?
I would like also to discuss reduce_add, there might be multiple ways of
doing it right but is there one that is faster? Is the same approach always
the best or it depends on the CPU? I believe that those questions are best
answered by the compiler.
Then some side-notes regarding clang documentation __builtin_shufflevector
is not referenced there
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
Best regards,
Alexandre Bique
On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com>
wrote:
> I'm not sure everyone would agree that the behavior of a
> __builtin_vector_hadd should do what the X86 instruction does. It takes two
> vectors and produces a result with elements from both vectors. Someone
> might argue that a horizontal add should just take one source and produce a
> vector with half the number of elements. Someone else might argue that a
> horizontal add should sum all the elements to a single scalar value. With
> different implementation choices like that its hard to say it should be a
> generic operation when the behavior might only make sense for one
target's
> instruction set.
>
> The behavior of the 256-bit vhaddps instruction on X86 is also weird since
> it treats the upper and lower 128-bits of the sources and destination
> independently. That quirk wouldn't make sense in a generic operation.
>
> You can emulate __builtin_ia32_haddps generically using
> __builtin_shufflevector and the + operator. The X86 backend should
> recognize it and use haddps.
>
> ~Craig
>
>
> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> I love llvm vectors, yet I wonder why some advanced vector operations
are
>> specific to some CPU targets?
>>
>> Let me take an example:
>>
>> /// Horizontally adds the adjacent pairs of values contained in two
>> /// 128-bit vectors of [4 x float].
>> ///
>> /// \headerfile <x86intrin.h>
>> ///
>> /// This intrinsic corresponds to the <c> VHADDPS </c>
instruction.
>> ///
>> /// \param __a
>> /// A 128-bit vector of [4 x float] containing one of the source
>> operands.
>> /// The horizontal sums of the values are stored in the lower bits
of
>> the
>> /// destination.
>> /// \param __b
>> /// A 128-bit vector of [4 x float] containing one of the source
>> operands.
>> /// The horizontal sums of the values are stored in the upper bits
of
>> the
>> /// destination.
>> /// \returns A 128-bit vector of [4 x float] containing the horizontal
>> sums of
>> /// both operands.
>> static __inline__ __m128 __DEFAULT_FN_ATTRS
>> _mm_hadd_ps(__m128 __a, __m128 __b)
>> {
>> return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);
>> }
>>
>> Here clang will translate _mm_hadd_ps to a CPU specific feature.
>> Why not create __builtin_vector_hadd(a, b) which would select the CPU
>> specific instruction or a fallback generic implementation?
>>
>> Many thanks,
>> Alex
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200820/5aff0a0d/attachment.html>