thr3ads.net - search: "v8i32"

Displaying 20 results from an estimated 31 matches for "v8i32".

Did you mean: v4i32

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

Hello all, This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend recognizes that these are addition reductions of multiplication so we use the vpmaddwd instruction which calculates 32-bit products from 16-bit inputs and does a horizontal add of adjacent pai...

Masked intrinsics and non-default address spaces

2016 Feb 15

Masked intrinsics and non-default address spaces

...aded intrinsics, the only generic type is the type of the value being loaded/stored. The signature of the intrinsic is generated based on this type. The type of the pointer argument is generated as a pointer to the return type with default addrspace. E.g.: declare <8 x i32> @llvm.masked.load.v8i32(<8 x i32>*, i32, <8 x i1>, <8 x i32>) The problem occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries t...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...his code https://godbolt.org/g/tTyxpf is a dot product reduction loop > multipying sign extended 16-bit values to produce a 32-bit accumulated > result. The x86 backend is currently not able to optimize it as well as gcc > and icc. The IR we are getting from the loop vectorizer has several v8i32 > adds and muls inside the loop. These are fed by v8i16 loads and sexts from > v8i16 to v8i32. The x86 backend recognizes that these are addition > reductions of multiplication so we use the vpmaddwd instruction which > calculates 32-bit products from 16-bit inputs and does a horizontal...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...olt.org/g/tTyxpf is a dot product reduction >> loop multipying sign extended 16-bit values to produce a 32-bit >> accumulated result. The x86 backend is currently not able to optimize >> it as well as gcc and icc. The IR we are getting from the loop >> vectorizer has several v8i32 adds and muls inside the loop. These are >> fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend >> recognizes that these are addition reductions of multiplication so we >> use the vpmaddwd instruction which calculates 32-bit products from >> 16-bit inputs and d...

Matching ConstantFPSDNode tablegen

2018 Jun 07

Matching ConstantFPSDNode tablegen

...ssues. So LLVM doesn't seem to accept a floating point constant literal match like: %v = call <4 x float> @foo(i32 15, float %s, float 0.0, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0) ret <4 x float> %v def : XXXPat<(v4f32 (int_foo i32:$mask, f32:$s, 0, v8i32:$rsrc, v4i32:$sampler, i1:$unorm, 0, i32:$cachepolicy)), (FOO_MI (COPY_TO_REGCLASS ?:$s, 32RegClass), ?:$rsrc, ?:$sampler, (as_i32imm ?:$mask), (as_i1imm ?:$unorm), (as_i1imm ?:$cachepolicy), (as_i1imm ?:$cachepolicy), 0, 0, 0, { 0 })>; which would be ideal. This seems to be because OPC_CheckIn...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...ps://godbolt.org/g/tTyxpf is a dot product reduction loop >> multipying sign extended 16-bit values to produce a 32-bit accumulated >> result. The x86 backend is currently not able to optimize it as well as gcc >> and icc. The IR we are getting from the loop vectorizer has several v8i32 >> adds and muls inside the loop. These are fed by v8i16 loads and sexts from >> v8i16 to v8i32. The x86 backend recognizes that these are addition >> reductions of multiplication so we use the vpmaddwd instruction which >> calculates 32-bit products from 16-bit inputs and d...

MCRegisterClass mandatory vs preferred alignment?

2015 Aug 31

MCRegisterClass mandatory vs preferred alignment?

...egen. From Target.td: > > class RegisterClass<string namespace, list<ValueType> regTypes, int alignment, > dag regList, RegAltNameIndex idx = NoRegAltName> > > X86RegisterInfo.td: > > def VR256 : RegisterClass<"X86", [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64], > 256, (sequence "YMM%u", 0, 15)>; > def VR256X : RegisterClass<"X86", [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64], > 256, (sequence "YMM%u", 0, 31)>; > > Seems to be 2...

[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported

2009 Dec 10

[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported

I have code that is generating sign extend in reg on a v8i32, but the backend does not support this data type. This then asserts in LegalizeVectorTypes.cpp:389 because there is no function to split this vector into smaller sizes. Would a correct solution be to add this case so to trigger the SplitVecRes_BinaryOp function? This asserts on both my backend...

MCRegisterClass mandatory vs preferred alignment?

2015 Aug 31

MCRegisterClass mandatory vs preferred alignment?

Looking around today, it appears that TargetRegisterClass and MCRegisterClass only includes a single alignment. This is documented as being the minimum legal alignment, but it appears to often be greater than this in practice. For instance, on x86 the alignment of %ymm0 is listed as 32, not 1. Does anyone know why this is? Additionally, where are these alignments actually defined? I

X86 TRUNCATE cost for AVX & AVX2 mode

2016 Apr 11

X86 TRUNCATE cost for AVX & AVX2 mode

...ost for this operation looks very high. Wondering why such a high cost kept for this, any pointers to understand this will be helpful. In few cases this restricts better vectorization opportunities. Other observations: Cost for TRUNCATE v16i32 to v16i8 in SSE2ConversionTbl as 7. Cost for TRUNCATE v8i32 to v8i8 is 2 in AVX2 and 4 in AVX mode. Thanks, Ashutosh -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160411/7233218b/attachment.html>

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

... is a dot product reduction loop > multipying sign extended 16-bit values to produce a 32-bit > accumulated result. The x86 backend is currently not able to > optimize it as well as gcc and icc. The IR we are getting from > the loop vectorizer has several v8i32 adds and muls inside the > loop. These are fed by v8i16 loads and sexts from v8i16 to > v8i32. The x86 backend recognizes that these are addition > reductions of multiplication so we use the vpmaddwd > instruction which calculates 32-bit products from 16-...

KNL Vectorization with larger vector width

2018 Jul 24

KNL Vectorization with larger vector width

...4> emission. But I cannot see the vector mix like in default knl if iterations=15 we see 1<8xi32> and rest scalar. so here when i keep iteration=2047 i get all scalar why is that so? similarly in polly as well i cant see vector mixes like its happening for KNL it emits <v16i32>, <v8i32>,<v4i32>...so here it should emit recursively like <v2048i32> <v1024i32> <v512i32>.....<v32i32> how to do this? What am i missing here? what further changes do i need to make? Please help... On Tue, Jul 24, 2018 at 1:52 AM, Friedman, Eli <efriedma at cod...

[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported

2009 Dec 10

[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported

On Wed, Dec 9, 2009 at 8:40 PM, Villmow, Micah <Micah.Villmow at amd.com> wrote: > I have code that is generating sign extend in reg on a v8i32, but the > backend does not support this data type. This then asserts in > LegalizeVectorTypes.cpp:389 because there is no function to split this > vector into smaller sizes. Would a correct solution be to add this case so > to trigger the SplitVecRes_BinaryOp function? SIGN_EXTEND_IN...

[LLVMdev] AVX spill alignment

2011 Aug 25

[LLVMdev] AVX spill alignment

Hey guys, Are spills/reloads of AVX registers using aligned stores/loads? I can't seem to find the code that aligns the stack slots to 32-bytes. Could someone point me in the right direction? Thanks, Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110825/b5724dec/attachment.html>

[LLVMdev] AVX spill alignment

2011 Sep 01

[LLVMdev] AVX spill alignment

...f AVX registers using aligned stores/loads? Yes. > I can't > seem to find the code that aligns the stack slots to 32-bytes. Could > someone point me in the right direction? The register class has 256-bit spill alignment: def VR256 : RegisterClass<"X86", [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64], 256, (sequence "YMM%u", 0, 15)> { let SubRegClasses = [(FR32 sub_ss), (FR64 sub_sd), (VR128 sub_xmm)]; } /jakob

Fwd: [PATCH] D17497: Support arbitrary address space for intrinsics

2016 Feb 24

Fwd: [PATCH] D17497: Support arbitrary address space for intrinsics

My gut feeling is that it’s not worth it. When we move from typed to untyped pointers, we’re going to change the mangling from something like p200i8 to just p200, which is already quite a bit cleaner, and actually looks cleaner to me than the version proposed in this patch. David > On 24 Feb 2016, at 17:28, Philip Reames via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > This

Vector trunc code generation difference between llvm-3.9 and 4.0

2017 Feb 17

Vector trunc code generation difference between llvm-3.9 and 4.0

...i16_t __attribute__((ext_vector_type(8))) > > v8i16_t foo (v8i16 a, int n) > { > return result = a >> n; > } > > With llvm-3.9, the generated sequence does a trunc followed by splat, but > with llvm-4.0 it is reversed to a splat to a bigger vector followed by a > v8i32->v8i16 trunc. Is this by design? The earlier code sequence is > definitely better for our target, but are there known scenarios where the > new sequence would lead to better code? > > Here are the instruction sequences generated in the two cases: > > With llvm 3.9: > > de...

Question about TableGen RegisterClass definition

2019 Jul 18

Question about TableGen RegisterClass definition

Hi All, I have a question about TableGen RegisterClass definition. I need to map different size of MVTs into a register class as below. def TestReg : RegisterClass<"Test", [v8i32, v4i32], ...> When I look at TableGen and CodeGen, it looks the types are used as following: 1. MCRegisterClass's RegSize and Alignment 2. SpillSize in TableGen 3. Type constraint for instruction pattern matching >From my opinion, it seems it is possible to do it... but I am not 100% s...

KNL Vectorization with larger vector width

2018 Jul 24

KNL Vectorization with larger vector width

...ot see the vector mix like in default knl if iterations=15 we >> see 1<8xi32> and rest scalar. so here when i keep iteration=2047 i get all >> scalar why is that so? similarly in polly as well i cant see vector mixes >> like its happening for KNL it emits <v16i32>, <v8i32>,<v4i32>...so here it >> should emit recursively like <v2048i32> <v1024i32> <v512i32>.....<v32i32> >> >> how to do this? >> >> What am i missing here? >> what further changes do i need to make? >> >> Please help... &gt...

[RFC] Semi-Automatic clang-format of files with low frequency

2020 Jun 30

[RFC] Semi-Automatic clang-format of files with low frequency

I 100% get that we might not like the decisions clang-format is making, but how does one overcome this when adding new code? The pre-merge checks enforce clang-formatting before commit and that's a common review comment anyway for those who didn't join the pre-merge checking group. I'm just wondering are we not all following the same guidelines? Concerns of clang-format not being good

search for: v8i32