thr3ads.net - search: "rdx"

avx512 JIT backend generates wrong code on <4 x float>

2016 Jun 29

2

avx512 JIT backend generates wrong code on <4 x float>

...l,avx512f,f16c,ssse3,mmx,-pku,cmov,-xop,rdseed,movbe,-hle,xsaveopt,-sha,sse2,sse3,-avx512dq, Assembly: .text .file "module_KFxOBX_i4_after.ll" .globl adjmul .align 16, 0x90 .type adjmul, at function adjmul: .cfi_startproc leaq (%rdi,%r8), %rdx addq %rsi, %r8 testb $1, %cl cmoveq %rdi, %rdx cmoveq %rsi, %r8 movq %rdx, %rax sarq $63, %rax shrq $62, %rax addq %rdx, %rax sarq $2, %rax movq %r8, %rcx sarq $63, %rcx shrq $62, %rcx addq %r8,...

New lazyload rdx key type: list(eagerKey=, lazyKeys=)

2019 Aug 30

1

New lazyload rdx key type: list(eagerKey=, lazyKeys=)

Prior to R-3.6.0 the keys in the lazyload key files, e.g. pkg/data/Rdata.rdx or pkg/R/pkg.rdx, seemed to all be 2-long integer vectors. Now they can be lists. The ones I have seen have two components, "eagerKey" is a 2-long integer vector and "lazyKeys" is a named list of 2-long integer vectors. > rdx <- readRDS(system.file(package="surviva...

avx512 JIT backend generates wrong code on <4 x float>

2016 Jun 29

0

avx512 JIT backend generates wrong code on <4 x float>

...d,movbe,-hle,xsaveopt,-sha,sse2,sse3,-avx512dq, > Assembly: > .text > .file "module_KFxOBX_i4_after.ll" > .globl adjmul > .align 16, 0x90 > .type adjmul, at function > adjmul: > .cfi_startproc > leaq (%rdi,%r8), %rdx > addq %rsi, %r8 > testb $1, %cl > cmoveq %rdi, %rdx > cmoveq %rsi, %r8 > movq %rdx, %rax > sarq $63, %rax > shrq $62, %rax > addq %rdx, %rax > sarq $2, %rax > movq %r8, %rcx > sarq...

avx512 JIT backend generates wrong code on <4 x float>

2016 Jun 30

1

avx512 JIT backend generates wrong code on <4 x float>

...2dq, >> Assembly: >> .text >> .file "module_KFxOBX_i4_after.ll" >> .globl adjmul >> .align 16, 0x90 >> .type adjmul, at function >> adjmul: >> .cfi_startproc >> leaq (%rdi,%r8), %rdx >> addq %rsi, %r8 >> testb $1, %cl >> cmoveq %rdi, %rdx >> cmoveq %rsi, %r8 >> movq %rdx, %rax >> sarq $63, %rax >> shrq $62, %rax >> addq %rdx, %rax >> sarq $2, %r...

Byte-wide stores aren't coalesced if interspersed with other stores

2018 Sep 11

2

Byte-wide stores aren't coalesced if interspersed with other stores

Andres: FWIW, codegen will do the merge if you turn on global alias analysis for it "-combiner-global-alias-analysis". That said, we should be able to do this merging earlier. -Nirav On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev < llvm-dev at lists.llvm.org> wrote: > Hi, > > On 2018-09-10 13:42:21 -0700, Andres Freund wrote: > > I have, in postres,

AVX512 instruction generated when JIT compiling for an avx2 architecture

2016 Jun 23

2

AVX512 instruction generated when JIT compiling for an avx2 architecture

...ed. Looking at the assembler reveals an AVX512 instruction which shouldn't be there. Assembly: .text .file "module" .globl main .align 16, 0x90 .type main, at function main: .cfi_startproc movq 8(%rsp), %r10 leaq (%rdi,%r8), %rdx addq %rsi, %r8 testb $1, %cl cmoveq %rdi, %rdx cmoveq %rsi, %r8 movq %rdx, %rax sarq $63, %rax shrq $62, %rax addq %rdx, %rax sarq $2, %rax movq %r8, %rcx sarq $63, %rcx shrq $62, %rcx addq %r8,...

[LLVMdev] Memory optimizations for LLVM JIT

2013 Aug 20

0

[LLVMdev] Memory optimizations for LLVM JIT

...IT is not as good as that generated by clang or llc. Here is an example: -------------------------------------------------------------------- source fragment ==> clang or llc struct { uint64_t a[10]; } *p; mov 0x8(%rax),%rdx p->a[2] = p->a[1]; mov %rdx,0x10(%rax) p->a[3] = p->a[1]; ==> mov %rdx,0x18(%rax) p->a[4] = p->a[2]; mov %rdx,0x20(%rax) p->a[5] = p->a[4]; mov %rdx,0x28(%rax) -------------------...

[PATCH v1 06/27] x86/entry/64: Adapt assembly for PIE support

2017 Oct 11

1

[PATCH v1 06/27] x86/entry/64: Adapt assembly for PIE support

...entry) movl %ecx, %eax /* zero extend */ cmpq %rax, RIP+8(%rsp) je .Lbstep_iret - cmpq $.Lgs_change, RIP+8(%rsp) + leaq .Lgs_change(%rip), %rcx + cmpq %rcx, RIP+8(%rsp) jne .Lerror_entry_done /* @@ -1383,10 +1388,10 @@ ENTRY(nmi) * resume the outer NMI. */ - movq $repeat_nmi, %rdx + leaq repeat_nmi(%rip), %rdx cmpq 8(%rsp), %rdx ja 1f - movq $end_repeat_nmi, %rdx + leaq end_repeat_nmi(%rip), %rdx cmpq 8(%rsp), %rdx ja nested_nmi_out 1: @@ -1440,7 +1445,8 @@ nested_nmi: pushq %rdx pushfq pushq $__KERNEL_CS - pushq $repeat_nmi + leaq repeat_nmi(%rip), %rdx + pus...

[LLVMdev] System call miscompilation using the fast register allocator

2013 Oct 22

1

[LLVMdev] System call miscompilation using the fast register allocator

...%val = alloca i32, align 4 store i32 1, i32* %val, align 4 %0 = ptrtoint i32* %val to i64 call void asm sideeffect "", "{r8}"(i64 4) nounwind call void asm sideeffect "", "{r10}"(i64 %0) nounwind call void asm sideeffect "", "{rdx}"(i64 3) nounwind call void asm sideeffect "", "{rsi}"(i64 1) nounwind call void asm sideeffect "", "{rdi}"(i64 -1) nounwind %1 = call i64 asm sideeffect "", "={rdi}"() nounwind %2 = call i64 asm sideeffect "", &...

[LLVMdev] Memory optimizations for LLVM JIT

2013 Aug 20

4

[LLVMdev] Memory optimizations for LLVM JIT

...IT is not as good as that generated by clang or llc. Here is an example: -------------------------------------------------------------------- source fragment ==> clang or llc struct { uint64_t a[10]; } *p; mov 0x8(%rax),%rdx p->a[2] = p->a[1]; mov %rdx,0x10(%rax) p->a[3] = p->a[1]; ==> mov %rdx,0x18(%rax) p->a[4] = p->a[2]; mov %rdx,0x20(%rax) p->a[5] = p->a[4]; mov %rdx,0x28(%rax) -------------------...

AVX512 instruction generated when JIT compiling for an avx2 architecture

2016 Jun 23

2

AVX512 instruction generated when JIT compiling for an avx2 architecture

...re. > > Assembly: > .text > .file "module" > .globl main > .align 16, 0x90 > .type main, at function > main: > .cfi_startproc > movq 8(%rsp), %r10 > leaq (%rdi,%r8), %rdx > addq %rsi, %r8 > testb $1, %cl > cmoveq %rdi, %rdx > cmoveq %rsi, %r8 > movq %rdx, %rax > sarq $63, %rax > shrq $62, %rax > addq %rdx, %rax > sarq $2, %rax > mo...

Byte-wide stores aren't coalesced if interspersed with other stores

2018 Sep 11

2

Byte-wide stores aren't coalesced if interspersed with other stores

...ss all the > previous stores) that allow it to its job. > > In the case at hand, with a manual 64bit store (this is on a 64bit > target), llvm then combines 8 byte-wide stores into one. > > > Without -combiner-global-alias-analysis it generates: > > movb $0, 1(%rdx) > movl 4(%rsi,%rdi), %ebx > movq %rbx, 8(%rcx) > movb $0, 2(%rdx) > movl 8(%rsi,%rdi), %ebx > movq %rbx, 16(%rcx) > movb $0, 3(%rdx) > movl 12(%rsi,%rdi), %ebx > movq %rbx, 24(%rcx) >...

[LLVMdev] Need a clue to improve the optimization of some C code

2015 Mar 03

2

[LLVMdev] Need a clue to improve the optimization of some C code

...ss ? Thanks for any feedback. Ciao Nat! P.S. In case someone is interested, here is the assembler code and the IR that produced it. Relevant LLVM generated x86_64 assembler portion with -Os ~~~ testq %r12, %r12 je LBB0_5 ## BB#1: movq -8(%r12), %rcx movq (%rcx), %rax movq -8(%rax), %rdx andq %r15, %rdx cmpq %r15, (%rax,%rdx) je LBB0_2 ## BB#3: addq $8, %rcx jmp LBB0_4 LBB0_2: leaq 8(%rdx,%rax), %rcx LBB0_4: movq %r12, %rdi movq %r15, %rsi movq %r14, %rdx callq *(%rcx) movq %rax, %rbx LBB0_5: ~~~ Better/tighter assembler code would be (saves 2 instructions, one jump les...

[LLVMdev] SIMD for sdiv <2 x i64>

2015 Jul 24

2

[LLVMdev] SIMD for sdiv <2 x i64>

...sdiv <2 x i64> %sub.ptr.sub.i6.i.i.i.i, <i64 24, i64 24> >> >> Assembly: >> vpsubq %xmm6, %xmm5, %xmm5 >> vmovq %xmm5, %rax >> movabsq $3074457345618258603, %rbx # imm = 0x2AAAAAAAAAAAAAAB >> imulq %rbx >> movq %rdx, %rcx >> movq %rcx, %rax >> shrq $63, %rax >> shrq $2, %rcx >> addl %eax, %ecx >> vpextrq $1, %xmm5, %rax >> imulq %rbx >> movq %rdx, %rax >> shrq $63, %rax >> shrq $2, %rdx &gt...

[LLVMdev] SIMD for sdiv <2 x i64>

2015 Jul 24

0

[LLVMdev] SIMD for sdiv <2 x i64>

...%rbx movq %rbx, %rdi vmovaps %xmm2, 96(%rsp) # 16-byte Spill vmovaps %xmm5, 64(%rsp) # 16-byte Spill vmovdqa %xmm6, 16(%rsp) # 16-byte Spill callq _Znam movq %rax, 128(%rsp) movq 16(%r12), %rsi movq %rax, %rdi movq %rbx, %rdx callq memmove vmovdqa 16(%rsp), %xmm6 # 16-byte Reload vmovaps 64(%rsp), %xmm5 # 16-byte Reload vmovaps 96(%rsp), %xmm2 # 16-byte Reload vmovdqa .LCPI582_0(%rip), %xmm4 .LBB582_4: # %invoke.cont vmovaps %xmm2, 96(%rsp)...

PIC and mcmodel=large on x86 doesn't use any relocations

2016 Oct 27

1

PIC and mcmodel=large on x86 doesn't use any relocations

...re() { // Large Memory Model code sequences from AMD64 abi // Figure 3.22: Position-Independent Global Data Load and Store // // Assume that %r15 has been loaded with GOT address by // function prologue. // movabs $Lsrc at GOTOFF,%rax ; R_X86_64_GOTOFF64 // movabs $Ldst at GOTOFF,%rdx ; R_X86_64_GOTOFF64 // movl (%rax,%r15),%ecx // movl %ecx,(%rdx,%r15) dst = src; // movabs $dptr at GOT,%rax ; R_X86_64_GOT64 // movabs $Ldst at GOTOFF,%rdx ; R_X86_64_GOTOFF64 // movq (%rax,%r15),%rax // leaq (%rdx,%r15),%rcx // movq %rcx,(%rax) dptr = &dst;...

How can I tell llvm, that a branch is preferred ?

2015 Oct 27

4

How can I tell llvm, that a branch is preferred ?

...d branch, correct ? I see nothing in the specs for "branch" or "switch". And __buildin_expect does nothing, that I am sure of. Unfortunately llvm has this knack for ordering my one most crucial part of code exactly the opposite I want to, it does: (x86_64) cmpq %r15, (%rax,%rdx) jne LBB0_3 Ltmp18: leaq 8(%rax,%rdx), %rcx jmp LBB0_4 LBB0_3: addq $8, %rcx LBB0_4: when I want, cmpq %r15, (%rax,%rdx) jeq LBB0_3 addq $8, %rcx jmp LBB0_4 LBB0_3: leaq 8(%rax,%rdx), %rcx LBB0_4: since that saves me executing a jump 99.9% of the time. Is there anything I can do ?...

[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets

2014 Oct 13

2

[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets

..._dso_handle+0x18> 400511: vcvttps2dq %xmm1,%xmm1 400515: vpmullw 0x183(%rip),%xmm1,%xmm1 # 4006a0 <__dso_handle+0x28> 40051d: vpsubd %xmm1,%xmm0,%xmm0 400521: vmovq %xmm0,%rax 400526: movslq %eax,%rcx 400529: sar $0x20,%rax 40052d: vpextrq $0x1,%xmm0,%rdx 400533: movslq %edx,%rsi 400536: sar $0x20,%rdx 40053a: vmovss 0x4006c0(,%rcx,4),%xmm0 400543: vinsertps $0x10,0x4006c0(,%rax,4),%xmm0,%xmm0 40054e: vinsertps $0x20,0x4006c0(,%rsi,4),%xmm0,%xmm0 400559: vinsertps $0x30,0x4006c0(,%rdx,4),%xmm0,%xmm0 400564: vmulps 0x14...

[cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

2017 Apr 19

3

[cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

...o I can find where the sequence is emitted. > > > $ more llvm/lib/Target/X86//README-X86-64.txt > … > Are we better off using branches instead of cmove to implement FP to > unsigned i64? > > _conv: > ucomiss LC0(%rip), %xmm0 > cvttss2siq %xmm0, %rdx > jb L3 > subss LC0(%rip), %xmm0 > movabsq $-9223372036854775808, %rax > cvttss2siq %xmm0, %rdx > xorq %rax, %rdx > L3: > movq %rdx, %rax > ret > > instead of > > _conv: > movs...

[LLVMdev] SIMD for sdiv <2 x i64>

2015 Jul 24

1

[LLVMdev] SIMD for sdiv <2 x i64>

...ovaps %xmm2, 96(%rsp) # 16-byte Spill > vmovaps %xmm5, 64(%rsp) # 16-byte Spill > vmovdqa %xmm6, 16(%rsp) # 16-byte Spill > callq _Znam > movq %rax, 128(%rsp) > movq 16(%r12), %rsi > movq %rax, %rdi > movq %rbx, %rdx > callq memmove > vmovdqa 16(%rsp), %xmm6 # 16-byte Reload > vmovaps 64(%rsp), %xmm5 # 16-byte Reload > vmovaps 96(%rsp), %xmm2 # 16-byte Reload > vmovdqa .LCPI582_0(%rip), %xmm4 > .LBB582_4: # %invoke.cont...

search for: rdx