Displaying 20 results from an estimated 1000 matches similar to: "[LoopVectorizer] Improving the performance of dot product reduction loop"
2018 Jul 23
4
[LoopVectorizer] Improving the performance of dot product reduction loop
~Craig
On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 07/23/2018 05:22 PM, Craig Topper wrote:
>
> Hello all,
>
> This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
> multipying sign extended 16-bit values to produce a 32-bit accumulated
> result. The x86 backend is currently not able to optimize it as well as gcc
2018 Jul 23
2
[LoopVectorizer] Improving the performance of dot product reduction loop
On 07/23/2018 06:23 PM, Hal Finkel via llvm-dev wrote:
>
> On 07/23/2018 05:22 PM, Craig Topper wrote:
>> Hello all,
>>
>> This code https://godbolt.org/g/tTyxpf is a dot product reduction
>> loop multipying sign extended 16-bit values to produce a 32-bit
>> accumulated result. The x86 backend is currently not able to optimize
>> it as well as gcc and icc.
2018 Jul 24
4
[LoopVectorizer] Improving the performance of dot product reduction loop
On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 07/23/2018 06:37 PM, Craig Topper wrote:
>
>
> ~Craig
>
>
> On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> On 07/23/2018 05:22 PM, Craig Topper wrote:
>>
>> Hello all,
>>
>> This code https://godbolt.org/g/tTyxpf
2018 Jul 24
2
[LoopVectorizer] Improving the performance of dot product reduction loop
On 07/24/2018 02:58 AM, Nema, Ashutosh wrote:
>
>
>
>
>
> *From:*Hal Finkel <hfinkel at anl.gov>
> *Sent:* Tuesday, July 24, 2018 5:05 AM
> *To:* Craig Topper <craig.topper at gmail.com>; hideki.saito at intel.com;
> estotzer at ti.com; Nemanja Ivanovic <nemanja.i.ibm at gmail.com>; Adam
> Nemet <anemet at apple.com>; graham.hunter at
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a
difference in the way the shift was handled. Does the initial IR generated
for you show this difference when the option is passed?
Best regards
Saurabh
On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com> wrote:
> I think this is caused by a front-end change (cc'ing clang-dev) because
>
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
The regression for the reported case should be avoided after:
https://reviews.llvm.org/rL297232
https://reviews.llvm.org/rL297242
https://reviews.llvm.org/rL297280
It would still be good to understand if the clang change was intentional or
if that was a side effect that can be limited.
On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> Yes, there is an
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet:
typedef signed short v8i16_t __attribute__((ext_vector_type(8)));
v8i16_t foo (v8i16_t a, int n)
{
return a >> n;
}
Best regards
Saurabh
On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
wrote:
> Hello,
>
> We are investigating a difference in code generation for vector splat
> instructions between llvm-3.9
2015 Aug 31
2
MCRegisterClass mandatory vs preferred alignment?
On 08/31/2015 03:59 PM, Matthias Braun wrote:
> Looks to me like the alignment is specified in tablegen. From Target.td:
>
> class RegisterClass<string namespace, list<ValueType> regTypes, int alignment,
> dag regList, RegAltNameIndex idx = NoRegAltName>
>
> X86RegisterInfo.td:
>
> def VR256 : RegisterClass<"X86", [v32i8,
2011 Sep 01
0
[LLVMdev] AVX spill alignment
On Aug 25, 2011, at 4:17 PM, Cameron McInally wrote:
> Hey guys,
>
> Are spills/reloads of AVX registers using aligned stores/loads?
Yes.
> I can't
> seem to find the code that aligns the stack slots to 32-bytes. Could
> someone point me in the right direction?
The register class has 256-bit spill alignment:
def VR256 : RegisterClass<"X86", [v32i8, v16i16,
2018 Jun 07
2
Matching ConstantFPSDNode tablegen
I'm trying to match a ConstantFPSDNode == 0 in dag pattern for tablegen but
am having some issues.
So LLVM doesn't seem to accept a floating point constant literal match like:
%v = call <4 x float> @foo(i32 15, float %s, float 0.0, <8 x i32> %rsrc, <4
x i32> %samp, i1 0, i32 0, i32 0)
ret <4 x float> %v
def : XXXPat<(v4f32 (int_foo i32:$mask, f32:$s, 0,
2015 Aug 31
3
MCRegisterClass mandatory vs preferred alignment?
Looking around today, it appears that TargetRegisterClass and
MCRegisterClass only includes a single alignment. This is documented as
being the minimum legal alignment, but it appears to often be greater
than this in practice. For instance, on x86 the alignment of %ymm0 is
listed as 32, not 1. Does anyone know why this is?
Additionally, where are these alignments actually defined? I
2016 Feb 15
5
Masked intrinsics and non-default address spaces
Masked load/store are overloaded intrinsics, the only generic type is the type of the value being loaded/stored. The signature of the intrinsic is generated based on this type. The type of the pointer argument is generated as a pointer to the return type with default addrspace. E.g.:
declare <8 x i32> @llvm.masked.load.v8i32(<8 x i32>*, i32, <8 x i1>, <8 x i32>)
The
2011 Aug 25
2
[LLVMdev] AVX spill alignment
Hey guys,
Are spills/reloads of AVX registers using aligned stores/loads? I can't
seem to find the code that aligns the stack slots to 32-bytes. Could
someone point me in the right direction?
Thanks,
Cameron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110825/b5724dec/attachment.html>
2009 Dec 10
2
[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported
I have code that is generating sign extend in reg on a v8i32, but the
backend does not support this data type. This then asserts in
LegalizeVectorTypes.cpp:389 because there is no function to split this
vector into smaller sizes. Would a correct solution be to add this case
so to trigger the SplitVecRes_BinaryOp function?
This asserts on both my backend and x86 and TOT does not seem to have
2016 Apr 11
2
X86 TRUNCATE cost for AVX & AVX2 mode
Hi,
I was going through the X86TTIImpl::getCastInstrCost, and got a doubt on cost
calculation for TRUNCATE instruction in AVX mode.
In AVX2ConversionTbl & AVXConversionTbl table there is no cost defined for
TRUNCATE v16i32 to v16i8, as a fallback it goes to SSE41ConversionTbl table and there
it finds cost as 30 for this operation. 30 cost for this operation looks very high.
Wondering why
2018 Jul 24
2
KNL Vectorization with larger vector width
Thank You.
Right now to see the effect i did following changes;
unsigned X86TTIImpl::getRegisterBitWidth(bool Vector) {
if (Vector) {
if (ST->hasAVX512())
return 65536;
here i changed 512 to 65536. Then in loopvectorize.cpp i did following;
assert(MaxVectorSize <= 2048 && "Did not expect to pack so many elements"
" into
2009 Dec 10
0
[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported
On Wed, Dec 9, 2009 at 8:40 PM, Villmow, Micah <Micah.Villmow at amd.com> wrote:
> I have code that is generating sign extend in reg on a v8i32, but the
> backend does not support this data type. This then asserts in
> LegalizeVectorTypes.cpp:389 because there is no function to split this
> vector into smaller sizes. Would a correct solution be to add this case so
> to
2017 Jun 25
2
AVX Scheduling and Parallelism
Hi Ahmed,
>From what can be seen in the code snippet you provided, the reuse of XMM0 and XMM1 across loop-unroll instances does not inhibit instruction-level parallelism.
Modern X86 processors use register renaming that can eliminate the dependencies in the instruction stream. In the example you provided, the processor should be able to identify the 2-vloads + vadd + vstore sequences as
2017 Jun 25
0
AVX Scheduling and Parallelism
Hi, Zvi,
I agree. In the context of targeting the KNL, however, I'm a bit
concerned about the addressing, and specifically, the size of the
resulting encoding:
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in
> zmm0
>
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]
> ; zmm1<-zmm1+b[401344]
The KNL can only
2017 Jun 24
4
AVX Scheduling and Parallelism
Hello,
After generating AVX code for large no of iterations i came to realize that
it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
factor=1024,
i wonder if this register allocation allows operations in parallel?
Also i know all the elements within a single vector instruction are
computed in parallel but does the elements of multiple instructions
computed in parallel? like are