Displaying 20 results from an estimated 800 matches similar to: "[LLVMdev] Autovectorization questions"
2014 Mar 12
4
[LLVMdev] Autovectorization questions
In order to vectorize code like this LLVM needs to prove that “A[i*7]” does not wrap in the address space. It fails to do so and so LLVM doesn’t vectorize this loop even if we try to force it.
The following loop will be vectorized if we force it:
int foo(int * A, int * B, int n, int k) {
for (int i = 0; i < 1024; ++i)
A[i] += B[i*k];
}
So will this loop:
int foo(int * restrict A, int
2012 Jul 26
0
[LLVMdev] X86 FMA4
Ah, bad example. This is a general problem for all (maybe most) SSE and AVX
SS/SD patterns though, which is why I mentioned Sandybridge. You can swap
out VFMADDSD in my example for VADDSD or whatever you like.
I have a lion's share of such a change implemented already and performance
is greatly affected. If the community is interested in this change, I would
be happy to prepare a patch.
2012 Jul 27
2
[LLVMdev] X86 FMA4
Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory.
vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5.
vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1.
He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a
2012 Jul 27
0
[LLVMdev] X86 FMA4
Hey Michael,
Thanks for the legwork!
It appears that the stats you listed are for movaps [SSE], not vmovaps
[AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256),
since they are both AVX instructions. Although, yes, I agree that this is
not clear from Agner's report. Please correct me if I am misunderstanding.
As I am sure you are aware, we cannot use SSE (movaps)
2015 Oct 02
2
Register Spill Caused by the Reassociation pass
This conflict is with many optimizations incl. copy prop, coalescing, hoisting etc. Each could increase register pressure and with similar impact. Attempts to control the register pressure locally (within an optimization pass) tend to get hard to tune and maintain. Would it be a better way to describe eg in metadata how to undo an optimization? Optimizations that attempt to reduce pressure like
2012 Jul 25
6
[LLVMdev] X86 FMA4
We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns.
Why is VFMADDSD4 defined with vector types? Is this simply because the
gcc intrinsic uses vector types? It's quite unnatural if you have a
compiler that generates FMAs as opposed to requiring user intrinsics.
-Dave
2012 Jul 26
1
[LLVMdev] X86 FMA4
Hey Jan and Dave,
It's not obvious, but there is a significant scalar performance issue
following the GCC intrinsics.
Let's look at the VFMADDSD pattern. We're operating on scalars with
undefineds as the remaining vector elements of the operands. This sounds
okay, but when one looks closer...
vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647
vmovaps %xmm3, 18560(%rsp)
2011 Mar 18
2
[LLVMdev] LLVM ERROR: No such instruction: `vmovsd ...' ?
Hello,
I am running a i7 MacBook Pro 2011. If I write:
@g = global double 0.000000e+00
define i32 @main() {
entry:
%0 = load double* @g
%1 = fmul double 1.000000e+06, %0
store double %1, double* @g
ret i32 0
}
in test.ll and I run
> llc test.ll
> gcc test.s
I get:
test.s:12:no such instruction: `vmovsd _g(%rip), %xmm0'
test.s:13:no such instruction: `vmulsd LCPI0_0(%rip),
2012 Jul 27
3
[LLVMdev] X86 FMA4
> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding.
You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for
2012 Mar 28
2
[LLVMdev] Suboptimal code due to excessive spilling
Hi,
I have run into the following strange behavior and wanted to ask for
some advice. For the C program below, function sum() gets inlined in
foo() but the code generated looks very suboptimal (the code is an
extract from a larger program).
Below I show the 32-bit x86 assembly as produced by the demo page on
the llvm home page ("Output A"). As you can see from the assembly,
after
2013 Aug 28
3
[PATCH] x86: AVX instruction emulation fixes
- we used the C4/C5 (first prefix) byte instead of the apparent ModR/M
one as the second prefix byte
- early decoding normalized vex.reg, thus corrupting it for the main
consumer (copy_REX_VEX()), resulting in #UD on the two-operand
instructions we emulate
Also add respective test cases to the testing utility plus
- fix get_fpu() (the fall-through order was inverted)
- add cpu_has_avx2,
2014 Mar 12
2
[LLVMdev] Autovectorization questions
On Mar 12, 2014, at 4:05 PM, Chandler Carruth <chandlerc at google.com> wrote:
>
> On Wed, Mar 12, 2014 at 3:50 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
> In order to vectorize code like this LLVM needs to prove that “A[i*7]” does not wrap in the address space. It fails to do so
>
> But, why?
>
> I'm moderately sure that neither C nor C++
2015 Jul 27
3
[LLVMdev] i1* function argument on x86-64
I am running into a problem with 'i1*' as a function's argument which
seems to have appeared since I switched to LLVM 3.6 (but can have other
source, of course). If I look at the assembler that the MCJIT generates
for an x86-64 target I see that the array 'i1*' is taken as a sequence
of 1 bit wide elements. (I guess that's correct). However, I used to
call the function
2014 Oct 16
2
[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
Seems that adding -extra-vectorizer-passes doesn't help to vectorizer
in my case. LoopRotation re-run does nothing.
2014-10-15 2:54 GMT+04:00 Chandler Carruth <chandlerc at google.com>:
>
> On Tue, Oct 14, 2014 at 3:50 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>> > I have and will continue to push
>> > back on trying to add it until we at least
2012 Apr 05
0
[LLVMdev] Suboptimal code due to excessive spilling
I don't know much about this, but maybe -mllvm -unroll-count=1 can be used as a workaround?
/Patrik Hägglund
-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Brent Walker
Sent: den 28 mars 2012 03:18
To: llvmdev
Subject: [LLVMdev] Suboptimal code due to excessive spilling
Hi,
I have run into the following strange behavior
2014 Oct 14
3
[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
I've added a straw-man of some extra optimization passes that help specific
benchmarks here or there by either preparing code better on the way into
the vectorizer or cleaning up afterward. These are off by default until
there is some consensus on the right path forward, but this way we can all
test out the same set of flags, and collaborate on any tweaks to them.
The primary principle here
2014 Apr 07
9
[LLVMdev] 3.4.1 Release Plans
Hi Robert,
Can you ping the code owners about these patches. It might be good
to write a separate email per code owner and cc the appropriate -commits
list.
Thanks,
Tom
On Wed, Apr 02, 2014 at 06:16:44PM +0400, Robert Khasanov wrote:
> Hi Tom,
>
> I would like to nominate the following patches to be backported to 3.4.1
>
> Clang:
> 1. r204742 - Zinovy Nis <zinovy.nis at
2016 Jun 29
2
avx512 JIT backend generates wrong code on <4 x float>
Hi!
When compiling the attached module with the JIT engine on an Intel KNL I
see wrong code getting emitted. I attach a complete exploit program
which shows the bug in LLVM 3.8. It loads and JIT compiles the module
and prints the assembler. I stumbled on this since the result of an
actual calculation was wrong. So, it's not only the text version of the
assembler also the machine
2012 Jul 26
0
[LLVMdev] X86 FMA4
Because the intrinsics uses vector types (same as gcc).
- Jan
----- Original Message -----
> From: "dag at cray.com" <dag at cray.com>
> To: llvmdev at cs.uiuc.edu
> Cc:
> Sent: Wednesday, July 25, 2012 3:26 PM
> Subject: [LLVMdev] X86 FMA4
>
> We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns.
>
> Why is VFMADDSD4
2018 Sep 21
2
X32 bugs around "cannot select" lingering around
Hi,
There's several, to my eyes, somewhat related looking bugs:
Bug 36743 - Cannot select: X86ISD::CALL ICE with -mx32 -O2 -fno-plt
https://bugs.llvm.org/show_bug.cgi?id=36743
Bug 34268 - JITting of x32 code on x64 fails with crash or instruction selection error.
https://bugs.llvm.org/show_bug.cgi?id=34268
There's unfortunately been no investigation.
I'm asking because I hit