thr3ads.net - similar to: "[LLVMdev] Autovectorization questions"

Displaying 20 results from an estimated 800 matches similar to: "[LLVMdev] Autovectorization questions"

2014 Mar 12

[LLVMdev] Autovectorization questions

In order to vectorize code like this LLVM needs to prove that “A[i*7]” does not wrap in the address space. It fails to do so and so LLVM doesn’t vectorize this loop even if we try to force it. The following loop will be vectorized if we force it: int foo(int * A, int * B, int n, int k) { for (int i = 0; i < 1024; ++i) A[i] += B[i*k]; } So will this loop: int foo(int * restrict A, int

[LLVMdev] X86 FMA4

2012 Jul 26

[LLVMdev] X86 FMA4

Ah, bad example. This is a general problem for all (maybe most) SSE and AVX SS/SD patterns though, which is why I mentioned Sandybridge. You can swap out VFMADDSD in my example for VADDSD or whatever you like. I have a lion's share of such a change implemented already and performance is greatly affected. If the community is interested in this change, I would be happy to prepare a patch.

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps)

2015 Oct 02

This conflict is with many optimizations incl. copy prop, coalescing, hoisting etc. Each could increase register pressure and with similar impact. Attempts to control the register pressure locally (within an optimization pass) tend to get hard to tune and maintain. Would it be a better way to describe eg in metadata how to undo an optimization? Optimizations that attempt to reduce pressure like

[LLVMdev] X86 FMA4

2012 Jul 25

[LLVMdev] X86 FMA4

We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. Why is VFMADDSD4 defined with vector types? Is this simply because the gcc intrinsic uses vector types? It's quite unnatural if you have a compiler that generates FMAs as opposed to requiring user intrinsics. -Dave

[LLVMdev] X86 FMA4

2012 Jul 26

[LLVMdev] X86 FMA4

Hey Jan and Dave, It's not obvious, but there is a significant scalar performance issue following the GCC intrinsics. Let's look at the VFMADDSD pattern. We're operating on scalars with undefineds as the remaining vector elements of the operands. This sounds okay, but when one looks closer... vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647 vmovaps %xmm3, 18560(%rsp)

[LLVMdev] LLVM ERROR: No such instruction: `vmovsd ...' ?

2011 Mar 18

[LLVMdev] LLVM ERROR: No such instruction: `vmovsd ...' ?

Hello, I am running a i7 MacBook Pro 2011. If I write: @g = global double 0.000000e+00 define i32 @main() { entry: %0 = load double* @g %1 = fmul double 1.000000e+06, %0 store double %1, double* @g ret i32 0 } in test.ll and I run > llc test.ll > gcc test.s I get: test.s:12:no such instruction: `vmovsd _g(%rip), %xmm0' test.s:13:no such instruction: `vmulsd LCPI0_0(%rip),

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for

[LLVMdev] Suboptimal code due to excessive spilling

2012 Mar 28

[LLVMdev] Suboptimal code due to excessive spilling

Hi, I have run into the following strange behavior and wanted to ask for some advice. For the C program below, function sum() gets inlined in foo() but the code generated looks very suboptimal (the code is an extract from a larger program). Below I show the 32-bit x86 assembly as produced by the demo page on the llvm home page ("Output A"). As you can see from the assembly, after

[PATCH] x86: AVX instruction emulation fixes

2013 Aug 28

[PATCH] x86: AVX instruction emulation fixes

- we used the C4/C5 (first prefix) byte instead of the apparent ModR/M one as the second prefix byte - early decoding normalized vex.reg, thus corrupting it for the main consumer (copy_REX_VEX()), resulting in #UD on the two-operand instructions we emulate Also add respective test cases to the testing utility plus - fix get_fpu() (the fall-through order was inverted) - add cpu_has_avx2,

[LLVMdev] Autovectorization questions

2014 Mar 12

[LLVMdev] Autovectorization questions

On Mar 12, 2014, at 4:05 PM, Chandler Carruth <chandlerc at google.com> wrote: > > On Wed, Mar 12, 2014 at 3:50 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote: > In order to vectorize code like this LLVM needs to prove that “A[i*7]” does not wrap in the address space. It fails to do so > > But, why? > > I'm moderately sure that neither C nor C++

[LLVMdev] i1* function argument on x86-64

2015 Jul 27

[LLVMdev] i1* function argument on x86-64

I am running into a problem with 'i1*' as a function's argument which seems to have appeared since I switched to LLVM 3.6 (but can have other source, of course). If I look at the assembler that the MCJIT generates for an x86-64 target I see that the array 'i1*' is taken as a sequence of 1 bit wide elements. (I guess that's correct). However, I used to call the function

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

2014 Oct 16

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

Seems that adding -extra-vectorizer-passes doesn't help to vectorizer in my case. LoopRotation re-run does nothing. 2014-10-15 2:54 GMT+04:00 Chandler Carruth <chandlerc at google.com>: > > On Tue, Oct 14, 2014 at 3:50 PM, Hal Finkel <hfinkel at anl.gov> wrote: >> >> > I have and will continue to push >> > back on trying to add it until we at least

[LLVMdev] Suboptimal code due to excessive spilling

2012 Apr 05

[LLVMdev] Suboptimal code due to excessive spilling

I don't know much about this, but maybe -mllvm -unroll-count=1 can be used as a workaround? /Patrik Hägglund -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Brent Walker Sent: den 28 mars 2012 03:18 To: llvmdev Subject: [LLVMdev] Suboptimal code due to excessive spilling Hi, I have run into the following strange behavior

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

2014 Oct 14

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

I've added a straw-man of some extra optimization passes that help specific benchmarks here or there by either preparing code better on the way into the vectorizer or cleaning up afterward. These are off by default until there is some consensus on the right path forward, but this way we can all test out the same set of flags, and collaborate on any tweaks to them. The primary principle here

[LLVMdev] 3.4.1 Release Plans

2014 Apr 07

[LLVMdev] 3.4.1 Release Plans

Hi Robert, Can you ping the code owners about these patches. It might be good to write a separate email per code owner and cc the appropriate -commits list. Thanks, Tom On Wed, Apr 02, 2014 at 06:16:44PM +0400, Robert Khasanov wrote: > Hi Tom, > > I would like to nominate the following patches to be backported to 3.4.1 > > Clang: > 1. r204742 - Zinovy Nis <zinovy.nis at

avx512 JIT backend generates wrong code on <4 x float>

2016 Jun 29

avx512 JIT backend generates wrong code on <4 x float>

Hi! When compiling the attached module with the JIT engine on an Intel KNL I see wrong code getting emitted. I attach a complete exploit program which shows the bug in LLVM 3.8. It loads and JIT compiles the module and prints the assembler. I stumbled on this since the result of an actual calculation was wrong. So, it's not only the text version of the assembler also the machine

[LLVMdev] X86 FMA4

2012 Jul 26

[LLVMdev] X86 FMA4

Because the intrinsics uses vector types (same as gcc). - Jan ----- Original Message ----- > From: "dag at cray.com" <dag at cray.com> > To: llvmdev at cs.uiuc.edu > Cc: > Sent: Wednesday, July 25, 2012 3:26 PM > Subject: [LLVMdev] X86 FMA4 > > We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. > > Why is VFMADDSD4

X32 bugs around "cannot select" lingering around

2018 Sep 21

X32 bugs around "cannot select" lingering around

Hi, There's several, to my eyes, somewhat related looking bugs: Bug 36743 - Cannot select: X86ISD::CALL ICE with -mx32 -O2 -fno-plt https://bugs.llvm.org/show_bug.cgi?id=36743 Bug 34268 - JITting of x32 code on x64 fails with crash or instruction selection error. https://bugs.llvm.org/show_bug.cgi?id=34268 There's unfortunately been no investigation. I'm asking because I hit

similar to: [LLVMdev] Autovectorization questions