similar to: [LLVMdev] unaligned AVX store gets split into two instructions

Displaying 20 results from an estimated 1000 matches similar to: "[LLVMdev] unaligned AVX store gets split into two instructions"

2013 Jul 10
3
[LLVMdev] unaligned AVX store gets split into two instructions
Hi, Yes. On Sandybridge 256-bit loads/stores are double pumped. This means that they go in one after the other in two cycles. On Haswell the memory ports are wide enough to allow a 256bit memory operation in one cycle. So, on Sandybridge we split unaligned memory operations into two 128bit parts to allow them to execute in two separate ports. This is also what GCC and ICC do. It is very
2013 Jul 10
2
[LLVMdev] unaligned AVX store gets split into two instructions
I've narrowed this down to a single kernel (kernel.ll), which does a fixed-size matrix-matrix multiply: # ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s # ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s # ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32 # ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33 # time ./harness32 real 0m0.584s user 0m0.581s sys 0m0.001s
2013 Dec 11
2
[LLVMdev] AVX code gen
Hello - I found this post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such
2013 Jul 10
0
[LLVMdev] unaligned AVX store gets split into two instructions
On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com> wrote: > I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector loads > on AVX. > 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as a > single instruction (details below). > In a matrix-matrix inner-kernel, I see a ~25% decrease in performance, which > seems to be
2013 Dec 12
0
[LLVMdev] AVX code gen
It probably does not pick the right processor architecture. You could try “clang -mavx” or “clang -march=corei7-avx” for ivy-bridge and “clang -march=core-avx2” or “clang -mavx2" for haswell. $ clang -march=core-avx2 -O3 -S -o - test.c .section __TEXT,__text,regular,pure_instructions .globl _f .align 4, 0x90 _f: ## @f
2013 Sep 19
0
[LLVMdev] unaligned AVX store gets split into two instructions
Nadav, We see multiple regressions after r172868 in ISPC compiler (based on LLVM optimizer). The regressions are due to spill/reloads, which are due to increase register pressure. This matches Zach's analysis. We've filed bug 17285 for this problem. Is there any possibility to avoid splitting in case of multiple loads going together? Dmitry. On Wed, Jul 10, 2013 at 1:12 PM, Zach
2013 Jul 10
0
[LLVMdev] unaligned AVX store gets split into two instructions
Thanks for all the the info! I'm still in the process of narrowing down the performance difference in my code. I'm no longer convinced its related to only the unaligned loads/stores alone since extracting this part of the kernel makes the performance difference disappear. I will try to narrow down what is going on and if it seems related LLVM, I will post an example. Thanks again, Zach
2020 Sep 01
2
Vector evolution?
Hi, Please consider the following loop: using v4f32 = float __attribute__((__vector_size__(16))); void fct6(v4f32 *x) { #pragma clang loop vectorize(enable) for (int i = 0; i < 256; ++i) x[i] = 7 * x[i]; } After compiling it with: clang++ -O3 -march=native -mtune=native \ -Rpass=loop-vectorize,slp-vectorize -Rpass-missed=loop-vectorize,slp-vectorize
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
Illustrative Example: clang -fveclib=SVML -O3 svml.c -mavx #include <math.h> void foo(double *a, int N){ int i; #pragma clang loop vectorize_width(8) for (i=0;i<N;i++){ a[i] = sin(i); } } Currently, this results in a call to <8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface,
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
Ashutosh, Thanks for the repy. Related earlier topic on this appears in the review of the SVML patch (@mmasten). Adding few names from there. https://reviews.llvm.org/D19544 There, I see Hal's review comment "let's start only with the directly-legal calls". Apparently, what we have right now in the trunk is "not legal enough". I'll work on the patch to stop
2018 Jul 02
2
[RFC][VECLIB] how should we legalize VECLIB calls?
Adding to Ashutosh's comments, We are also interested in making LLVM generate vector math library calls that are available with glibc (version > 2.22). reference: https://sourceware.org/glibc/wiki/libmvec Using the example case given in the reference, we found there are 2 vector versions for "sin" (4 X double) with same VF namely _ZGVcN4v_sin (avx) version and _ZGVdN4v_sin
2019 Sep 02
3
AVX2 codegen - question reg. FMA generation
Hello, On the appended reasonably simple test case that has an fmul/fadd sequence on <8 x float> vector types, I don't see the x86-64 code generator (with cpu set to haswell or later types) turning it into an AVX2 FMA instructions. Here's the snippet in the output it generates: $ llc -O3 -mcpu=skylake --------------------- .LBB0_2: # =>This Inner
2018 Jul 02
8
[RFC][VECLIB] how should we legalize VECLIB calls?
On 07/02/2018 04:33 PM, Saito, Hideki wrote: > >   > > >It may not be a full solution for the problems you're trying to solve > >   > > If we are inventing a new solution, I’d like it also to solve OpenMP > declare simd legalization issue. If a small extension of existing scheme > > works for mathlib only, I’m happy to take that and discuss OpenMP >
2018 Jul 02
2
[RFC][VECLIB] how should we legalize VECLIB calls?
It may not be a full solution for the problems you're trying to solve, but I don't know why adding to include/llvm/CodeGen/RuntimeLibcalls.def is a problem in itself. Certainly, it's a mess that could be organized, especially so we're not repeating everything for each data type as we do right now. So yes, I think that would allow us to remove the VecLib mappings because we are
2012 Jul 29
2
[LLVMdev] rotate
in C or C++, how can I get clang/llvm to try and do a "rotate". (want to test this code in the mips16 port) i.e. emit rotr node. tia. reed
2012 Jul 29
3
[LLVMdev] rotate
Nice! Clever compiler.. On 07/28/2012 08:55 PM, Michael Gottesman wrote: > I can get clang/llvm to emit a rotate instruction on x86-64 when compiling C by just using -Os and the rotate from Hacker's Delight i.e., > > ====== > #include<stdlib.h> > #include<stdint.h> > > uint32_t ror(uint32_t input, size_t rot_bits) > { > return (input>>
2015 Nov 04
2
Vectorizing structure reads, writes, etc on X86-64 AVX
Hi Jay - I see the slow, small accesses using an older clang [Apple LLVM version 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change that comes into play if you don't specify a particular CPU: http://llvm.org/viewvc/llvm-project?view=revision&revision=245950 $ ./clang -O1 -mavx copy.c -S -o - ... movslq %edi, %rax movq _spr_dynamic at GOTPCREL(%rip),
2010 Sep 01
5
[LLVMdev] equivalent IR, different asm
The attached .ll files seem equivalent, but the resulting asm from 'opt-fail.ll' causes a crash to webkit. I suspect the usage of registers is wrong, can someone take a look ? $ llc opt-pass.ll -o - .section __TEXT,__text,regular,pure_instructions .globl __ZN7WebCore6kolos1ERiS0_PKNS_20RenderBoxModelObjectEPNS_10StyleImageE .align 4, 0x90
2010 Sep 01
0
[LLVMdev] equivalent IR, different asm
On Sep 1, 2010, at 6:25 AM, Argyrios Kyrtzidis wrote: > The attached .ll files seem equivalent, but the resulting asm from 'opt-fail.ll' causes a crash to webkit. > I suspect the usage of registers is wrong, can someone take a look ? The difference is that there is a shift right after the multiply, before the divide. In IR, the difference is: %5 = mul nsw i32 %4, %tmp1
2015 May 04
2
[LLVMdev] Incorrect code generated for arm64
Hi all, I’ve narrowed down a problem in my code to the following test case: - - - - typedef struct {float v[2];} vec2; typedef struct {float v[3];} vec3; vec2 getVec2(); vec3 getVec3() { vec2 myVec = getVec2(); vec3 res; res.v[0] = myVec.v[0]; res.v[1] = myVec.v[1]; res.v[2] = 1; return res; } - - - - Compiling this with any level of optimization for arm64 gives incorrect code,