Displaying 9 results from an estimated 9 matches for "__builtin_shufflevector".
2020 Aug 20
2
Question about llvm vectors
...would like also to discuss reduce_add, there might be multiple ways of
doing it right but is there one that is faster? Is the same approach always
the best or it depends on the CPU? I believe that those questions are best
answered by the compiler.
Then some side-notes regarding clang documentation __builtin_shufflevector
is not referenced there
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
Best regards,
Alexandre Bique
On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com> wrote:
> I'm not sure everyone would agree that the behavior of a
> __b...
2020 Aug 19
2
Question about llvm vectors
Hi,
I love llvm vectors, yet I wonder why some advanced vector operations are
specific to some CPU targets?
Let me take an example:
/// Horizontally adds the adjacent pairs of values contained in two
/// 128-bit vectors of [4 x float].
///
/// \headerfile <x86intrin.h>
///
/// This intrinsic corresponds to the <c> VHADDPS </c> instruction.
///
/// \param __a
/// A
2017 Sep 13
2
RFC phantom memory intrinsic
...te? Only one offset does not seem enough to handle generic cases.
Yes, correct, this a little bit changed example is not working.
#include <x86intrin.h>
__m256d vsht_d4_fold(const double* ptr, unsigned long long i) {
__m256d foo = (__m256d){ ptr[i], ptr[i+1], ptr[i+2], ptr[i+3] };
return __builtin_shufflevector( foo, foo, 3, 3, 2, 2 );
}
But with the aggregate case it is a new level of complexity, should we
we care about? There might be some logic that probably would be mark
as dead by InstCombine and we don't want to keep it.
BTW: Looks like SLP could not recognize the case either :
define <4 x do...
2017 Sep 13
2
RFC phantom memory intrinsic
...ric cases.
>> Yes, correct, this a little bit changed example is not working.
>> #include <x86intrin.h>
>>
>> __m256d vsht_d4_fold(const double* ptr, unsigned long long i) {
>> __m256d foo = (__m256d){ ptr[i], ptr[i+1], ptr[i+2], ptr[i+3] };
>> return __builtin_shufflevector( foo, foo, 3, 3, 2, 2 );
>> }
>> But with the aggregate case it is a new level of complexity, should we
>> we care about? There might be some logic that probably would be mark
>> as dead by InstCombine and we don't want to keep it.
>> BTW: Looks like SLP could not...
2017 Sep 26
0
RFC phantom memory intrinsic
...s, correct, this a little bit changed example is not working.
>>> #include <x86intrin.h>
>>>
>>> __m256d vsht_d4_fold(const double* ptr, unsigned long long i) {
>>> __m256d foo = (__m256d){ ptr[i], ptr[i+1], ptr[i+2], ptr[i+3] };
>>> return __builtin_shufflevector( foo, foo, 3, 3, 2, 2 );
>>> }
>>> But with the aggregate case it is a new level of complexity, should we
>>> we care about? There might be some logic that probably would be mark
>>> as dead by InstCombine and we don't want to keep it.
>>> BTW: Looks...
2014 Sep 23
2
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Sun, Sep 21, 2014 at 1:15 PM, Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:
> On 20 Sep 2014, at 19:44, Chandler Carruth <chandlerc at google.com> wrote:
>
> > If AVX is available I would expect the vpermilps/vpermilpd instruction
> to be used for all float/double single vector shuffles, especially as it
> can deal with the folded load case as well - this would
2017 Sep 26
2
RFC phantom memory intrinsic
...ittle bit changed example is not working.
>>>> #include <x86intrin.h>
>>>>
>>>> __m256d vsht_d4_fold(const double* ptr, unsigned long long i) {
>>>> __m256d foo = (__m256d){ ptr[i], ptr[i+1], ptr[i+2], ptr[i+3] };
>>>> return __builtin_shufflevector( foo, foo, 3, 3, 2, 2 );
>>>> }
>>>> But with the aggregate case it is a new level of complexity, should we
>>>> we care about? There might be some logic that probably would be mark
>>>> as dead by InstCombine and we don't want to keep it.
>>...
2017 Sep 26
0
RFC phantom memory intrinsic
...ple is not working.
>>>>> #include <x86intrin.h>
>>>>>
>>>>> __m256d vsht_d4_fold(const double* ptr, unsigned long long i) {
>>>>> __m256d foo = (__m256d){ ptr[i], ptr[i+1], ptr[i+2], ptr[i+3] };
>>>>> return __builtin_shufflevector( foo, foo, 3, 3, 2, 2 );
>>>>> }
>>>>> But with the aggregate case it is a new level of complexity, should we
>>>>> we care about? There might be some logic that probably would be mark
>>>>> as dead by InstCombine and we don't want to...
2017 Sep 12
3
RFC phantom memory intrinsic
Hi,
For PR21780 solution, I plan to add a new functionality to restore
memory operations that was once deleted, in this particular case it is
the load operations that were deleted by InstCombine, please note that
once the load was removed there is no way to restore it back and that
prevents us from vectorizing the shuffle operation. There are probably
more similar issues where this approach could