thr3ads.net - search: "vst1"

Displaying 17 results from an estimated 17 matches for "vst1".

Did you mean: lst1

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 07

[LLVMdev] Unaligned vector memory access for ARM/NEON.

...same issue with generating vector loads/stores for >> vectors with less than word alignment. It seems we took a similar >> approach to solving the problem by modifying the logic in > allowsUnalignedMemoryAccesses. >> >> As you and Jim mentioned, it looks like the vld1/vst1 instructions >> should support element aligned access for any armv7 implementation >> (I'm looking at Table A3-1 ARM Architecture Reference Manual - ARM DDI > 0406C). >> >> Right now I do not think we have the correct code setup in >> ARMSubtarget to accurat...

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 07

[LLVMdev] Unaligned vector memory access for ARM/NEON.

...ating vector loads/stores for > >> vectors with less than word alignment. It seems we took a similar > >> approach to solving the problem by modifying the logic in > > allowsUnalignedMemoryAccesses. > >> > >> As you and Jim mentioned, it looks like the vld1/vst1 instructions > >> should support element aligned access for any armv7 implementation > >> (I'm looking at Table A3-1 ARM Architecture Reference Manual - ARM > >> DDI > > 0406C). > >> > >> Right now I do not think we have the correct code setup...

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 06

[LLVMdev] Unaligned vector memory access for ARM/NEON.

...> We ran into the same issue with generating vector loads/stores for > vectors with less than word alignment. It seems we took a similar > approach to solving the problem by modifying the logic in allowsUnalignedMemoryAccesses. > > As you and Jim mentioned, it looks like the vld1/vst1 instructions > should support element aligned access for any armv7 implementation > (I'm looking at Table A3-1 ARM Architecture Reference Manual - ARM DDI 0406C). > > Right now I do not think we have the correct code setup in > ARMSubtarget to accurately represent this table....

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 06

[LLVMdev] Unaligned vector memory access for ARM/NEON.

...; > We ran into the same issue with generating vector loads/stores for vectors > with less than word alignment. It seems we took a similar approach to > solving the problem by modifying the logic in allowsUnalignedMemoryAccesses. > > As you and Jim mentioned, it looks like the vld1/vst1 instructions should > support element aligned access for any armv7 implementation (I'm looking at > Table A3-1 ARM Architecture Reference Manual - ARM DDI 0406C). > > Right now I do not think we have the correct code setup in ARMSubtarget to > accurately represent this table. I...

[RFC] Half-Precision Support in the Arm Backends

2017 Dec 06

[RFC] Half-Precision Support in the Arm Backends

Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good. I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16 and FP16_TO_FP nodes are created to avoid inefficient code generation. I will double check if I can't achieve the same without using these nodes (because I really would like to get completely rid...

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 06

[LLVMdev] Unaligned vector memory access for ARM/NEON.

Hi Pete, We ran into the same issue with generating vector loads/stores for vectors with less than word alignment. It seems we took a similar approach to solving the problem by modifying the logic in allowsUnalignedMemoryAccesses. As you and Jim mentioned, it looks like the vld1/vst1 instructions should support element aligned access for any armv7 implementation (I'm looking at Table A3-1 ARM Architecture Reference Manual - ARM DDI 0406C). Right now I do not think we have the correct code setup in ARMSubtarget to accurately represent this table. I would propose that we kee...

[RFC] Half-Precision Support in the Arm Backends

2017 Dec 04

[RFC] Half-Precision Support in the Arm Backends

Hi, I am working on C/C++ language support for the Armv8.2-A half-precision instructions. I've added support for _Float16 as a new source language type to Clang. _Float16 is a C11 extension type for which arithmetic is well defined, as opposed to e.g. __fp16 which is a storage-only type. I then fixed up the AArch64 backend, which was mostly straightforward: this involved making operations

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 06

[LLVMdev] Unaligned vector memory access for ARM/NEON.

...rmv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the intermediate LLVM assembly >> looks OK (and it has an align 2 vector load), but the generated ARM assembly has the scalar loads. >> When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm -mfpu=neon -O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment. This is on armv7a. >> >> The gcc version (and the clang version with our modified backend) runs fine, even on 2-byte aligned data. Is this not a guarantee across armv7/armv7a generally? >> >> Pete >> >> >> >&gt...

[RFC] Half-Precision Support in the Arm Backends

2018 Jan 18

[RFC] Half-Precision Support in the Arm Backends

...llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org> Sent: 06 December 2017 08:32 To: Friedman, Eli; llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good. I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16 and FP16_TO_FP nodes are created to avoid inefficient code generation. I will double check if I can't achieve the same without using these nodes (because I really would like to get completely rid...

[RFC] Half-Precision Support in the Arm Backends

2018 Jan 18

[RFC] Half-Precision Support in the Arm Backends

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 10

[LLVMdev] Unaligned vector memory access for ARM/NEON.

On Sep 7, 2012, at 4:46 PM, David Peixotto <dpeixott at codeaurora.org> wrote: >> Note that this won't work on big-endian targets where changing the >> vld1/vst1 element size actually changes the behavior of the instructions. > > Ok, so would it be best to put an explicit test for endianess in the code? > The td files already contain restrictions some for endianess for selecting > unaligned loads/stores. Probably the safest thing would be an ex...

[LLVMdev] [RFC] Stripping unusable intrinsics

2014 Dec 23

[LLVMdev] [RFC] Stripping unusable intrinsics

...ntrinsic::memmove: { ... case Intrinsic::arm_neon_vld1: { assert(ArgIdx == 0 && "Invalid argument index"); assert(Loc.Ptr == II->getArgOperand(ArgIdx) && "Intrinsic location pointer not argument?"); // LLVM's vld1 and vst1 intrinsics currently only support a single // vector register. if (DL) Loc.Size = DL->getTypeStoreSize(II->getType()); break; } } Just change this to: switch (II->getIntrinsicID()) { case Intrinsic::memset: case Intrinsic::memcpy: case Intrins...

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 10

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...haps this is due to NEON supporting interleaved loads for loading arrays of structs into vector registers of each element. I suspect that isn’t very common across architectures, so it wouldn’t surprise me if there was no IR instruction for the interleaved cases (vld[234].*). It seems the vld1.* and vst1.* do have those direct IR representations though. It’s great news if this is fixed in the current tip, but in the short term (for app store builds using the official toolchain) are there any LLVM-specific extensions to initialise a float32x4_t that will get lowered to the "load <4 x float&...

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 08

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...e around it. > > Creating a bug for this is probably the best thing to do, since this > is a common pattern that needs looking into to produce optimal code. Thanks for the responses. I’ve filed bug #21778 for this: http://llvm.org/bugs/show_bug.cgi?id=21778 I’ve also tried replacing the vst1.32 with setting the data[i] elements individually with vgetq_lane, which gets at least the single multiply case back to optimal code. There’s still an unneeded temporary when doing res = a * b * c though. Anyway, let’s continue this on the bug tracker :) Simon

[LLVMdev] [RFC] Stripping unusable intrinsics

2014 Dec 23

[LLVMdev] [RFC] Stripping unusable intrinsics

On Dec 23, 2014, at 10:28 AM, Chris Bieneman <beanz at apple.com> wrote: >>> It should be straight-forward to have something like LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a registry. >> >> I tried doing that a few years ago. It’s not nearly as easy as it sounds because we’ve got hardcoded references to various target intrinsics scattered

[LLVMdev] [global-isel] Proposal for a global instruction selector

2013 Aug 08

[LLVMdev] [global-isel] Proposal for a global instruction selector

...ould be legal: All 8-bit types via ldrb/strb to GPR. (i8, v1i8, v2i4, v4i2, v8i1) All 16-bit types via ldrh/strh to GPR. (i16, f16, v1i16, v2i8, ...) All 32-bit types via ldr/str to GPR and vldr/vstr to SPR. All 64-bit types via ldrd/strd to GPRPair and vldr/vstr to DPR. All 128-bit types via vld1/vst1 to DPair. All 192-bit types via vld1/vst1 to DTriple. All 256-bit types via vld1/vst1 to DQuad. This larger set of legal types also makes it easier to handle things like extractelement <8 x i8> which currently produces an illegal type and is thus obfuscated by the type legalizer. An i8 live...

[RFC] - Deduplication of debug information in linkers (LLD)

2017 Dec 04

[RFC] - Deduplication of debug information in linkers (LLD)

...e > generation > > to stack loads/stores which I don't want. > > 2) Custom lowering f16 stores is very similar, and creates truncating > > half-word integer stores. > > Technically, there are no f16 load/store instructions, yes, but we can > use NEON vdl1 and vst1 to get something roughly equivalent, right? > > You probably want to custom-lower BITCAST instructions; the generic > sequence emitted by the legalizer is pretty inefficient in most cases. > > --- > > Overall, I think your approach makes sense. > > -Eli > > -- >...

search for: vst1