Hi Craig, Thank you very much for your answer. I did not want to discuss exactly the semantic and name of one operation but instead raise the question "would it be beneficial to have more vector builtins?". You wrote that the compiler will recognize a pattern and replace it by __builtin_ia32_haddps when possible, but how can I be sure of that? I would have to disassemble the generated code right? It is very impractical isn'it? And it leads me to understand that each CPU target has a bank of patterns which it can recognize but wouldn't it be very similar to have advanced generic vector operations and CPU specific implementation for those builtins? Regarding hadd; I agree, the name does not very well describe what it is doing. And yes hadd could be summing all the vector elements, but I think that the usual terminology for that is reduce_add. In my case I use it for computing the mono signal of a stereo interleaved signal: a = load(in); b = load(in + K); l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a r = suffle(a, b, 1, 3, 5, 7, ...); m = .5 * (l + r); // m has the same size as a and b which is maybe optimal for memory I/O? store(m, out); As you said it, I could have m being half of the size of a, and I would not need to load b. Which approach would deliver the best performance? Does the compiler recognize both? Maybe there is another valid approach, will the compiler recognize it? I would like also to discuss reduce_add, there might be multiple ways of doing it right but is there one that is faster? Is the same approach always the best or it depends on the CPU? I believe that those questions are best answered by the compiler. Then some side-notes regarding clang documentation __builtin_shufflevector is not referenced there https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors Best regards, Alexandre Bique On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com> wrote:> I'm not sure everyone would agree that the behavior of a > __builtin_vector_hadd should do what the X86 instruction does. It takes two > vectors and produces a result with elements from both vectors. Someone > might argue that a horizontal add should just take one source and produce a > vector with half the number of elements. Someone else might argue that a > horizontal add should sum all the elements to a single scalar value. With > different implementation choices like that its hard to say it should be a > generic operation when the behavior might only make sense for one target's > instruction set. > > The behavior of the 256-bit vhaddps instruction on X86 is also weird since > it treats the upper and lower 128-bits of the sources and destination > independently. That quirk wouldn't make sense in a generic operation. > > You can emulate __builtin_ia32_haddps generically using > __builtin_shufflevector and the + operator. The X86 backend should > recognize it and use haddps. > > ~Craig > > > On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi, >> >> I love llvm vectors, yet I wonder why some advanced vector operations are >> specific to some CPU targets? >> >> Let me take an example: >> >> /// Horizontally adds the adjacent pairs of values contained in two >> /// 128-bit vectors of [4 x float]. >> /// >> /// \headerfile <x86intrin.h> >> /// >> /// This intrinsic corresponds to the <c> VHADDPS </c> instruction. >> /// >> /// \param __a >> /// A 128-bit vector of [4 x float] containing one of the source >> operands. >> /// The horizontal sums of the values are stored in the lower bits of >> the >> /// destination. >> /// \param __b >> /// A 128-bit vector of [4 x float] containing one of the source >> operands. >> /// The horizontal sums of the values are stored in the upper bits of >> the >> /// destination. >> /// \returns A 128-bit vector of [4 x float] containing the horizontal >> sums of >> /// both operands. >> static __inline__ __m128 __DEFAULT_FN_ATTRS >> _mm_hadd_ps(__m128 __a, __m128 __b) >> { >> return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b); >> } >> >> Here clang will translate _mm_hadd_ps to a CPU specific feature. >> Why not create __builtin_vector_hadd(a, b) which would select the CPU >> specific instruction or a fallback generic implementation? >> >> Many thanks, >> Alex >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200820/5aff0a0d/attachment.html>
__builtin_shufflevector was supposed to be linked here https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors but due to a mistake in the source file its generated from a link was made to __builtin_shufflevector instead. I've fixed that and it should hopefully update in the next day or two. We have internal intrinsics for reduce_add that are used by the autovectorizers. I could see it making sense to expose those to C as a builtin. For X86 I think we always reduce at each stage by moving the upper half of the vector to the lower half with a shuffle and then adding it to the lower half. I think on some CPUs we use haddps/haddpd to do the last stage of combining element 1 with element 0. But most CPUs we use a shuffle and a addps/addpd. Intel CPUs use 2 shuffles and addps/addpd internally to implement haddps/haddpd. And on Intel CPUs there's only one execution unit that can do the 2 shuffles. So they execute serially before the addps/addpd. So for reductions it is better just emit a single shuffle in assembly than to use haddps/pd. ~Craig On Thu, Aug 20, 2020 at 2:17 AM Alexandre Bique <bique.alexandre at gmail.com> wrote:> Hi Craig, > > Thank you very much for your answer. > > I did not want to discuss exactly the semantic and name of one operation > but instead raise the question "would it be beneficial to have more vector > builtins?". > > You wrote that the compiler will recognize a pattern and replace it by > __builtin_ia32_haddps when possible, but how can I be sure of that? I would > have to disassemble the generated code right? It is very > impractical isn'it? And it leads me to understand that each CPU target has > a bank of patterns which it can recognize but wouldn't it be very similar > to have advanced generic vector operations and CPU specific implementation > for those builtins? > > Regarding hadd; I agree, the name does not very well describe what it is > doing. And yes hadd could be summing all the vector elements, but I think > that the usual terminology for that is reduce_add. > > In my case I use it for computing the mono signal of a stereo interleaved > signal: > > a = load(in); > b = load(in + K); > l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a > r = suffle(a, b, 1, 3, 5, 7, ...); > m = .5 * (l + r); // m has the same size as a and b which is maybe optimal > for memory I/O? > store(m, out); > > As you said it, I could have m being half of the size of a, and I would > not need to load b. Which approach would deliver the best performance? Does > the compiler recognize both? Maybe there is another valid approach, will > the compiler recognize it? > > I would like also to discuss reduce_add, there might be multiple ways of > doing it right but is there one that is faster? Is the same approach always > the best or it depends on the CPU? I believe that those questions are best > answered by the compiler. > > Then some side-notes regarding clang documentation __builtin_shufflevector > is not referenced there > https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors > > Best regards, > Alexandre Bique > > > On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com> > wrote: > >> I'm not sure everyone would agree that the behavior of a >> __builtin_vector_hadd should do what the X86 instruction does. It takes two >> vectors and produces a result with elements from both vectors. Someone >> might argue that a horizontal add should just take one source and produce a >> vector with half the number of elements. Someone else might argue that a >> horizontal add should sum all the elements to a single scalar value. With >> different implementation choices like that its hard to say it should be a >> generic operation when the behavior might only make sense for one target's >> instruction set. >> >> The behavior of the 256-bit vhaddps instruction on X86 is also weird >> since it treats the upper and lower 128-bits of the sources and destination >> independently. That quirk wouldn't make sense in a generic operation. >> >> You can emulate __builtin_ia32_haddps generically using >> __builtin_shufflevector and the + operator. The X86 backend should >> recognize it and use haddps. >> >> ~Craig >> >> >> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> Hi, >>> >>> I love llvm vectors, yet I wonder why some advanced vector operations >>> are specific to some CPU targets? >>> >>> Let me take an example: >>> >>> /// Horizontally adds the adjacent pairs of values contained in two >>> /// 128-bit vectors of [4 x float]. >>> /// >>> /// \headerfile <x86intrin.h> >>> /// >>> /// This intrinsic corresponds to the <c> VHADDPS </c> instruction. >>> /// >>> /// \param __a >>> /// A 128-bit vector of [4 x float] containing one of the source >>> operands. >>> /// The horizontal sums of the values are stored in the lower bits of >>> the >>> /// destination. >>> /// \param __b >>> /// A 128-bit vector of [4 x float] containing one of the source >>> operands. >>> /// The horizontal sums of the values are stored in the upper bits of >>> the >>> /// destination. >>> /// \returns A 128-bit vector of [4 x float] containing the horizontal >>> sums of >>> /// both operands. >>> static __inline__ __m128 __DEFAULT_FN_ATTRS >>> _mm_hadd_ps(__m128 __a, __m128 __b) >>> { >>> return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b); >>> } >>> >>> Here clang will translate _mm_hadd_ps to a CPU specific feature. >>> Why not create __builtin_vector_hadd(a, b) which would select the CPU >>> specific instruction or a fallback generic implementation? >>> >>> Many thanks, >>> Alex >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200821/54cb8fbc/attachment.html>
Thank you very much for the explanation. I have one more question: it is possible in LLVM IR to call sin() on a vector. Yet I did not find how to do it with clang and I've tried various things: #include <cmath> using vec = float __attribute__((__vector_size__(4 * 4))); vec fct(vec a) { vec b = std::exp(a); //vec b = __builtin_exp(a); //vec b{std::exp(a[0]), std::exp(a[1]), std::exp(a[2]), std::exp(a[3])}; //vec b{__builtin_expf(a[0]), __builtin_expf(a[1]), __builtin_expf(a[2]), __builtin_expf(a[3])}; return b; } Do you know how to do that? Regards, Alexandre Bique On Fri, Aug 21, 2020 at 9:09 PM Craig Topper <craig.topper at gmail.com> wrote:> __builtin_shufflevector was supposed to be linked here > https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors > but due to a mistake in the source file its generated from a link was made > to __builtin_shufflevector instead. I've fixed that and it should hopefully > update in the next day or two. > > We have internal intrinsics for reduce_add that are used by the > autovectorizers. I could see it making sense to expose those to C as a > builtin. For X86 I think we always reduce at each stage by moving the upper > half of the vector to the lower half with a shuffle and then adding it to > the lower half. I think on some CPUs we use haddps/haddpd to do the last > stage of combining element 1 with element 0. But most CPUs we use a shuffle > and a addps/addpd. Intel CPUs use 2 shuffles and addps/addpd internally to > implement haddps/haddpd. And on Intel CPUs there's only one execution unit > that can do the 2 shuffles. So they execute serially before the > addps/addpd. So for reductions it is better just emit a single shuffle in > assembly than to use haddps/pd. > > ~Craig > > > On Thu, Aug 20, 2020 at 2:17 AM Alexandre Bique <bique.alexandre at gmail.com> > wrote: > >> Hi Craig, >> >> Thank you very much for your answer. >> >> I did not want to discuss exactly the semantic and name of one operation >> but instead raise the question "would it be beneficial to have more vector >> builtins?". >> >> You wrote that the compiler will recognize a pattern and replace it by >> __builtin_ia32_haddps when possible, but how can I be sure of that? I would >> have to disassemble the generated code right? It is very >> impractical isn'it? And it leads me to understand that each CPU target has >> a bank of patterns which it can recognize but wouldn't it be very similar >> to have advanced generic vector operations and CPU specific implementation >> for those builtins? >> >> Regarding hadd; I agree, the name does not very well describe what it is >> doing. And yes hadd could be summing all the vector elements, but I think >> that the usual terminology for that is reduce_add. >> >> In my case I use it for computing the mono signal of a stereo interleaved >> signal: >> >> a = load(in); >> b = load(in + K); >> l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a >> r = suffle(a, b, 1, 3, 5, 7, ...); >> m = .5 * (l + r); // m has the same size as a and b which is maybe >> optimal for memory I/O? >> store(m, out); >> >> As you said it, I could have m being half of the size of a, and I would >> not need to load b. Which approach would deliver the best performance? Does >> the compiler recognize both? Maybe there is another valid approach, will >> the compiler recognize it? >> >> I would like also to discuss reduce_add, there might be multiple ways of >> doing it right but is there one that is faster? Is the same approach always >> the best or it depends on the CPU? I believe that those questions are best >> answered by the compiler. >> >> Then some side-notes regarding clang >> documentation __builtin_shufflevector is not referenced there >> https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors >> >> Best regards, >> Alexandre Bique >> >> >> On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com> >> wrote: >> >>> I'm not sure everyone would agree that the behavior of a >>> __builtin_vector_hadd should do what the X86 instruction does. It takes two >>> vectors and produces a result with elements from both vectors. Someone >>> might argue that a horizontal add should just take one source and produce a >>> vector with half the number of elements. Someone else might argue that a >>> horizontal add should sum all the elements to a single scalar value. With >>> different implementation choices like that its hard to say it should be a >>> generic operation when the behavior might only make sense for one target's >>> instruction set. >>> >>> The behavior of the 256-bit vhaddps instruction on X86 is also weird >>> since it treats the upper and lower 128-bits of the sources and destination >>> independently. That quirk wouldn't make sense in a generic operation. >>> >>> You can emulate __builtin_ia32_haddps generically using >>> __builtin_shufflevector and the + operator. The X86 backend should >>> recognize it and use haddps. >>> >>> ~Craig >>> >>> >>> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> Hi, >>>> >>>> I love llvm vectors, yet I wonder why some advanced vector operations >>>> are specific to some CPU targets? >>>> >>>> Let me take an example: >>>> >>>> /// Horizontally adds the adjacent pairs of values contained in two >>>> /// 128-bit vectors of [4 x float]. >>>> /// >>>> /// \headerfile <x86intrin.h> >>>> /// >>>> /// This intrinsic corresponds to the <c> VHADDPS </c> instruction. >>>> /// >>>> /// \param __a >>>> /// A 128-bit vector of [4 x float] containing one of the source >>>> operands. >>>> /// The horizontal sums of the values are stored in the lower bits >>>> of the >>>> /// destination. >>>> /// \param __b >>>> /// A 128-bit vector of [4 x float] containing one of the source >>>> operands. >>>> /// The horizontal sums of the values are stored in the upper bits >>>> of the >>>> /// destination. >>>> /// \returns A 128-bit vector of [4 x float] containing the horizontal >>>> sums of >>>> /// both operands. >>>> static __inline__ __m128 __DEFAULT_FN_ATTRS >>>> _mm_hadd_ps(__m128 __a, __m128 __b) >>>> { >>>> return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b); >>>> } >>>> >>>> Here clang will translate _mm_hadd_ps to a CPU specific feature. >>>> Why not create __builtin_vector_hadd(a, b) which would select the CPU >>>> specific instruction or a fallback generic implementation? >>>> >>>> Many thanks, >>>> Alex >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200821/9fdc7e6d/attachment.html>