Displaying 20 results from an estimated 31 matches for "v8i32".
Did you mean:
v4i32
2018 Jul 23
3
[LoopVectorizer] Improving the performance of dot product reduction loop
Hello all,
This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
multipying sign extended 16-bit values to produce a 32-bit accumulated
result. The x86 backend is currently not able to optimize it as well as gcc
and icc. The IR we are getting from the loop vectorizer has several v8i32
adds and muls inside the loop. These are fed by v8i16 loads and sexts from
v8i16 to v8i32. The x86 backend recognizes that these are addition
reductions of multiplication so we use the vpmaddwd instruction which
calculates 32-bit products from 16-bit inputs and does a horizontal add of
adjacent pai...
2016 Feb 15
5
Masked intrinsics and non-default address spaces
...aded intrinsics, the only generic type is the type of the value being loaded/stored. The signature of the intrinsic is generated based on this type. The type of the pointer argument is generated as a pointer to the return type with default addrspace. E.g.:
declare <8 x i32> @llvm.masked.load.v8i32(<8 x i32>*, i32, <8 x i1>, <8 x i32>)
The problem occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries t...
2018 Jul 23
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...his code https://godbolt.org/g/tTyxpf is a dot product reduction loop
> multipying sign extended 16-bit values to produce a 32-bit accumulated
> result. The x86 backend is currently not able to optimize it as well as gcc
> and icc. The IR we are getting from the loop vectorizer has several v8i32
> adds and muls inside the loop. These are fed by v8i16 loads and sexts from
> v8i16 to v8i32. The x86 backend recognizes that these are addition
> reductions of multiplication so we use the vpmaddwd instruction which
> calculates 32-bit products from 16-bit inputs and does a horizontal...
2018 Jul 23
2
[LoopVectorizer] Improving the performance of dot product reduction loop
...olt.org/g/tTyxpf is a dot product reduction
>> loop multipying sign extended 16-bit values to produce a 32-bit
>> accumulated result. The x86 backend is currently not able to optimize
>> it as well as gcc and icc. The IR we are getting from the loop
>> vectorizer has several v8i32 adds and muls inside the loop. These are
>> fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend
>> recognizes that these are addition reductions of multiplication so we
>> use the vpmaddwd instruction which calculates 32-bit products from
>> 16-bit inputs and d...
2018 Jun 07
2
Matching ConstantFPSDNode tablegen
...ssues.
So LLVM doesn't seem to accept a floating point constant literal match like:
%v = call <4 x float> @foo(i32 15, float %s, float 0.0, <8 x i32> %rsrc, <4
x i32> %samp, i1 0, i32 0, i32 0)
ret <4 x float> %v
def : XXXPat<(v4f32 (int_foo i32:$mask, f32:$s, 0, v8i32:$rsrc,
v4i32:$sampler, i1:$unorm, 0, i32:$cachepolicy)), (FOO_MI (COPY_TO_REGCLASS
?:$s, 32RegClass), ?:$rsrc, ?:$sampler, (as_i32imm ?:$mask), (as_i1imm
?:$unorm), (as_i1imm ?:$cachepolicy), (as_i1imm ?:$cachepolicy), 0, 0, 0, {
0 })>;
which would be ideal. This seems to be because OPC_CheckIn...
2018 Jul 24
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...ps://godbolt.org/g/tTyxpf is a dot product reduction loop
>> multipying sign extended 16-bit values to produce a 32-bit accumulated
>> result. The x86 backend is currently not able to optimize it as well as gcc
>> and icc. The IR we are getting from the loop vectorizer has several v8i32
>> adds and muls inside the loop. These are fed by v8i16 loads and sexts from
>> v8i16 to v8i32. The x86 backend recognizes that these are addition
>> reductions of multiplication so we use the vpmaddwd instruction which
>> calculates 32-bit products from 16-bit inputs and d...
2015 Aug 31
2
MCRegisterClass mandatory vs preferred alignment?
...egen. From Target.td:
>
> class RegisterClass<string namespace, list<ValueType> regTypes, int alignment,
> dag regList, RegAltNameIndex idx = NoRegAltName>
>
> X86RegisterInfo.td:
>
> def VR256 : RegisterClass<"X86", [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64],
> 256, (sequence "YMM%u", 0, 15)>;
> def VR256X : RegisterClass<"X86", [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64],
> 256, (sequence "YMM%u", 0, 31)>;
>
> Seems to be 2...
2009 Dec 10
2
[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported
I have code that is generating sign extend in reg on a v8i32, but the
backend does not support this data type. This then asserts in
LegalizeVectorTypes.cpp:389 because there is no function to split this
vector into smaller sizes. Would a correct solution be to add this case
so to trigger the SplitVecRes_BinaryOp function?
This asserts on both my backend...
2015 Aug 31
3
MCRegisterClass mandatory vs preferred alignment?
Looking around today, it appears that TargetRegisterClass and
MCRegisterClass only includes a single alignment. This is documented as
being the minimum legal alignment, but it appears to often be greater
than this in practice. For instance, on x86 the alignment of %ymm0 is
listed as 32, not 1. Does anyone know why this is?
Additionally, where are these alignments actually defined? I
2016 Apr 11
2
X86 TRUNCATE cost for AVX & AVX2 mode
...ost for this operation looks very high.
Wondering why such a high cost kept for this, any pointers to understand this will be helpful.
In few cases this restricts better vectorization opportunities.
Other observations:
Cost for TRUNCATE v16i32 to v16i8 in SSE2ConversionTbl as 7.
Cost for TRUNCATE v8i32 to v8i8 is 2 in AVX2 and 4 in AVX mode.
Thanks,
Ashutosh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160411/7233218b/attachment.html>
2018 Jul 24
2
[LoopVectorizer] Improving the performance of dot product reduction loop
... is a dot product reduction loop
> multipying sign extended 16-bit values to produce a 32-bit
> accumulated result. The x86 backend is currently not able to
> optimize it as well as gcc and icc. The IR we are getting from
> the loop vectorizer has several v8i32 adds and muls inside the
> loop. These are fed by v8i16 loads and sexts from v8i16 to
> v8i32. The x86 backend recognizes that these are addition
> reductions of multiplication so we use the vpmaddwd
> instruction which calculates 32-bit products from 16-...
2018 Jul 24
2
KNL Vectorization with larger vector width
...4> emission.
But I cannot see the vector mix like in default knl if iterations=15 we see
1<8xi32> and rest scalar. so here when i keep iteration=2047 i get all
scalar why is that so? similarly in polly as well i cant see vector mixes
like its happening for KNL it emits <v16i32>, <v8i32>,<v4i32>...so here it
should emit recursively like <v2048i32> <v1024i32> <v512i32>.....<v32i32>
how to do this?
What am i missing here?
what further changes do i need to make?
Please help...
On Tue, Jul 24, 2018 at 1:52 AM, Friedman, Eli <efriedma at cod...
2009 Dec 10
0
[LLVMdev] SplitVecRes with SIGN_EXTEND_INREG unsupported
On Wed, Dec 9, 2009 at 8:40 PM, Villmow, Micah <Micah.Villmow at amd.com> wrote:
> I have code that is generating sign extend in reg on a v8i32, but the
> backend does not support this data type. This then asserts in
> LegalizeVectorTypes.cpp:389 because there is no function to split this
> vector into smaller sizes. Would a correct solution be to add this case so
> to trigger the SplitVecRes_BinaryOp function?
SIGN_EXTEND_IN...
2011 Aug 25
2
[LLVMdev] AVX spill alignment
Hey guys,
Are spills/reloads of AVX registers using aligned stores/loads? I can't
seem to find the code that aligns the stack slots to 32-bytes. Could
someone point me in the right direction?
Thanks,
Cameron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110825/b5724dec/attachment.html>
2011 Sep 01
0
[LLVMdev] AVX spill alignment
...f AVX registers using aligned stores/loads?
Yes.
> I can't
> seem to find the code that aligns the stack slots to 32-bytes. Could
> someone point me in the right direction?
The register class has 256-bit spill alignment:
def VR256 : RegisterClass<"X86", [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64],
256, (sequence "YMM%u", 0, 15)> {
let SubRegClasses = [(FR32 sub_ss), (FR64 sub_sd), (VR128 sub_xmm)];
}
/jakob
2016 Feb 24
0
Fwd: [PATCH] D17497: Support arbitrary address space for intrinsics
My gut feeling is that it’s not worth it. When we move from typed to untyped pointers, we’re going to change the mangling from something like p200i8 to just p200, which is already quite a bit cleaner, and actually looks cleaner to me than the version proposed in this patch.
David
> On 24 Feb 2016, at 17:28, Philip Reames via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>
> This
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...i16_t __attribute__((ext_vector_type(8)))
>
> v8i16_t foo (v8i16 a, int n)
> {
> return result = a >> n;
> }
>
> With llvm-3.9, the generated sequence does a trunc followed by splat, but
> with llvm-4.0 it is reversed to a splat to a bigger vector followed by a
> v8i32->v8i16 trunc. Is this by design? The earlier code sequence is
> definitely better for our target, but are there known scenarios where the
> new sequence would lead to better code?
>
> Here are the instruction sequences generated in the two cases:
>
> With llvm 3.9:
>
> de...
2019 Jul 18
2
Question about TableGen RegisterClass definition
Hi All,
I have a question about TableGen RegisterClass definition.
I need to map different size of MVTs into a register class as below.
def TestReg : RegisterClass<"Test", [v8i32, v4i32], ...>
When I look at TableGen and CodeGen, it looks the types are used as following:
1. MCRegisterClass's RegSize and Alignment
2. SpillSize in TableGen
3. Type constraint for instruction pattern matching
>From my opinion, it seems it is possible to do it... but I am not 100% s...
2018 Jul 24
2
KNL Vectorization with larger vector width
...ot see the vector mix like in default knl if iterations=15 we
>> see 1<8xi32> and rest scalar. so here when i keep iteration=2047 i get all
>> scalar why is that so? similarly in polly as well i cant see vector mixes
>> like its happening for KNL it emits <v16i32>, <v8i32>,<v4i32>...so here it
>> should emit recursively like <v2048i32> <v1024i32> <v512i32>.....<v32i32>
>>
>> how to do this?
>>
>> What am i missing here?
>> what further changes do i need to make?
>>
>> Please help...
>...
2020 Jun 30
5
[RFC] Semi-Automatic clang-format of files with low frequency
I 100% get that we might not like the decisions clang-format is making, but
how does one overcome this when adding new code? The pre-merge checks
enforce clang-formatting before commit and that's a common review comment
anyway for those who didn't join the pre-merge checking group. I'm just
wondering are we not all following the same guidelines?
Concerns of clang-format not being good