thr3ads.net - similar to: "Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)"

Displaying 20 results from an estimated 1000 matches similar to: "Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)"

Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)

2018 Nov 27

Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)

"Sanjay Patel" <spatel at rotateright.com> wrote: > IIUC, you want to use x86-specific bit-hacks (sbb masking) in cases like > this: > unsigned int foo(unsigned int crc) { > if (crc & 0x80000000) > crc <<= 1, crc ^= 0xEDB88320; > else > crc <<= 1; > return crc; > } To document this for x86 too: rewrite the function

Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)

2018 Nov 28

Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)

On Wed, Nov 28, 2018 at 7:11 AM Sanjay Patel via llvm-dev < llvm-dev at lists.llvm.org> wrote: > Thanks for reporting this and other perf opportunities. As I mentioned > before, if you could file bug reports for these, that's probably the only > way they're ever going to get fixed (unless you're planning to fix them > yourself). It's not an ideal situation, but

[RFC] New pass: LoopExitValues

2015 Aug 31

[RFC] New pass: LoopExitValues

Hello LLVM, This is a proposal for a new pass that improves performance and code size in some nested loop situations. The pass is target independent. >From the description in the file header: This optimization finds loop exit values reevaluated after the loop execution and replaces them by the corresponding exit values if they are available. Such sequences can arise after the

[RFC] New pass: LoopExitValues

2015 Sep 01

[RFC] New pass: LoopExitValues

On Mon, Aug 31, 2015 at 5:52 PM, Jake VanAdrighem <jvanadrighem at gmail.com> wrote: > Do you have some specific performance measurements? Averaging 4 runs of 10000 iterations each of Coremark on my X86_64 desktop showed: -O2 performance: +2.9% faster with the L.E.V. pass -Os size: 1.5% smaller with the L.E.V. pass In the case of Coremark, the benefit comes mainly from the matrix

[LLVMdev] LICM promoting memory to scalar

2014 Sep 02

[LLVMdev] LICM promoting memory to scalar

All, If we can speculatively execute a load instruction, why isn’t it safe to hoist it out by promoting it to a scalar in LICM pass? There is a comment in LICM pass that if a load/store is conditional then it is not safe because it would break the LLVM concurrency model (See commit 73bfa4a). It has an IR test for checking this in test/Transforms/LICM/scalar-promote-memmodel.ll However, I have

[LLVMdev] LICM promoting memory to scalar

2014 Sep 02

[LLVMdev] LICM promoting memory to scalar

I think gcc is right. It inserted a branch for n == 0 (the cbz at the top), so that's not a problem. In all other regards, this is safe: if you examine the sequence of loads and stores, it eliminated all but the first load and all but the last store. How's that unsafe? If I had to guess, the bug here is that LLVM doesn't want to hoist the load over the condition (which it is right

[LLVMdev] LICM promoting memory to scalar

2014 Sep 03

[LLVMdev] LICM promoting memory to scalar

Thanks for the background on the concurrent memory model. So, is it sufficient that the loop entry is guarded by condition (cbz at top) for preventing the race? The loop entry will be guarded by condition if loop has been rotated by loop rotate pass. Since LICM runs after loop rotate, we can use ScalarEvolution::isLoopEntryGuardedByCond to check if we can speculatively execute load without

KNL Assembly Code for Matrix Multiplication

2017 Jul 01

KNL Assembly Code for Matrix Multiplication

Thank You, It means vmovdqa64 zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 = [8,9,10,11,12,13,14,15] zmm22 will contain 64 bit constant values which are indexes here zmm22=8, 9, 10, 11, 12,13,14,15. not the values loaded from these locations. and zmm2 contains constant 4000. so, vpmuludq zmm14, zmm10, zmm2 ; will multiply the indexes values with 4000, as for array b the stride is 4000. zmm14=

Expected constant simplification not happening

2016 Feb 11

Expected constant simplification not happening

Hi the appended IR code does not optimize to my liking :) this is the interesting part in x86_64, that got produced via clang -Os: --- movq -16(%r12), %rax movl -4(%rax), %ecx andl $2298949, %ecx ## imm = 0x231445 cmpq $2298949, (%rax,%rcx) ## imm = 0x231445 leaq 8(%rax,%rcx), %rax cmovneq %r15, %rax movl $2298949, %esi ## imm = 0x231445 movq %r12, %rdi movq %r14,

Where's the optimiser gone (part 11): use the proper instruction for sign extension

2019 Mar 04

Where's the optimiser gone (part 11): use the proper instruction for sign extension

Compile with -O3 -m32 (see <https://godbolt.org/z/yCpBpM>): long lsign(long x) { return (x > 0) - (x < 0); } long long llsign(long long x) { return (x > 0) - (x < 0); } While the code generated for the "long" version of this function is quite OK, the code for the "long long" version misses an obvious optimisation: lsign: # @lsign mov

[LLVMdev] Critical edges

2006 Jul 09

[LLVMdev] Critical edges

Dear guys, I am having problem to split edges correctly. Mostly because the new basic blocks are creating infinite loops. Could someone help me fixing the code below? It is creating assembly like this one below. Block LBB1_9 was inserted to break the critical edge between blocks LBB1_3 and LBB1_8. But it changes the semantics of the original program, because, before, LBB1_8 was falling

Expected constant simplification not happening

2016 Dec 07

Expected constant simplification not happening

Hello Has there been any progress on this topic ? The 3.9 optimizer output is still the same as I just looked. https://llvm.org/bugs/show_bug.cgi?id=24448 Ciao Nat! Sanjay Patel schrieb: > [cc'ing Zia] > > We have this transform with -Os for some cases after: > http://reviews.llvm.org/rL244601 > http://reviews.llvm.org/D11363 > > but something in this example is

RFC: callee saved register verifier

2016 May 13

RFC: callee saved register verifier

Hi, This is a proposal to introduce a mechanism to LLVM that uses runtime instrumentation to verify that a call target preserves all of the registers it says it does. Internally, we have a diverse set of calling conventions for runtime functions that get called from LLVM compiled code, and having some sort of validation mechanism will give us more confidence in selecting aggressive calling

[LLVMdev] Is va_arg correct on Mips backend?

2013 Feb 20

[LLVMdev] Is va_arg correct on Mips backend?

I didn't have Mips board. I compile as the commands and check the asm output as below. 1. Question: The distance of caller arg[4] and arg[5] is 4 bytes. But the the callee get every arg[] by 8 bytes offset (arg_ptr1+8 or arg_ptr2+8). I assume the #BB#4 and #BB#5 are the arg_ptr which is the pointer to access the stack arguments. 2. Question: Stack memory 28($sp) has no initial value. If

[LLVMdev] Is va_arg correct on Mips backend?

2013 Feb 20

[LLVMdev] Is va_arg correct on Mips backend?

Does it make a difference if you give the "-target" option to clang? $ clang -target mips-linux-gnu ch8_3.cpp -o ch8_3.bc -emit-llvm -c The .s file generated this way looks quite different from the one in your email. On Tue, Feb 19, 2013 at 5:06 PM, Jonathan <gamma_chen at yahoo.com.tw> wrote: > I didn't have Mips board. I compile as the commands and check the asm >

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

2014 Oct 24

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

Hi, I noticed a significant performance regression (up to 40%) on some internal CUDA benchmarks (a reduced example presented below). The root cause of this regression seems that IndVarSimpilfy widens induction variables assuming arithmetics on wider integer types are as cheap as those on narrower ones. However, this assumption is wrong at least for the NVPTX64 target. Although the NVPTX64 target

[hexagon][PowerPC] code regression (sub-optimal code) on LLVM 9 when generating hardware loops, and the "llvm.uadd" intrinsic.

2019 Jun 30

[hexagon][PowerPC] code regression (sub-optimal code) on LLVM 9 when generating hardware loops, and the "llvm.uadd" intrinsic.

Hi All, The following code : void hexagon2( int *a, int *res ) { int i = 100; while ( i-- ) { *res++ = *a++; } } gets compiled as a sub-optimal Software loop on LLVM 9.0 instead of a Hardware loop, whereas it was compiled as a Hardware Loop in LLVM 7.0. This is the final assembly code generated by LLVM 9.0 : .text .file "main.c" .globl hexagon2 // --

[LLVMdev] Bignum development

2010 Jun 11

[LLVMdev] Bignum development

Hi all, After searching for a decent compiler backend for ages (google sometimes isn't helpful), I recently stumbled upon LLVM. Woot!! I work on bignum arithmetic (I'm a professional mathematician) and have recently decided to switch from developing GPL'd bignum code to BSD licensed code. (See http://www.mpir.org/ which I contributed to for a while - a fork of GMP). Please bear with

[LLVMdev] Is va_arg correct on Mips backend?

2013 Feb 19

[LLVMdev] Is va_arg correct on Mips backend?

Which part of the generated code do you think is not correct? Could you be more specific? I compiled this program with clang and ran it on a mips board. It returns the expected result (21). On Tue, Feb 19, 2013 at 4:15 AM, Jonathan <gamma_chen at yahoo.com.tw> wrote: > I check the Mips backend for the following C code fragment compile result. > It seems not correct. Is it my

[LLVMdev] Critical edges

2006 Jul 05

[LLVMdev] Critical edges

> If you don't want critical edges in the machine code CFG, you're going to > have to write a machine code CFG critical edge splitting pass: LLVM > doesn't currently have one. > > -Chris Hey guys, I've coded a pass to break the critical edges of the machine control flow graph. The program works fine, but I am sure it is not the right way of implementing it.

similar to: Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)