thr3ads.net - similar to: "How shall I evaluate the latency of each instruction in LLVM IR?"

Displaying 20 results from an estimated 500 matches similar to: "How shall I evaluate the latency of each instruction in LLVM IR?"

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

2014 Dec 22

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of Herbie Robinson > Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences > > On 12/21/14 4:27 AM, Kuperstein, Michael M wrote: > > Which performance guidelines are you referring to? > Table C-21 in "Intel(r) 64 and IA-32 Architectures

Adding support for self-modifying branches to LLVM?

2016 Jan 21

Adding support for self-modifying branches to LLVM?

On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote: > > AFAIK, the cost of a well-predicted, not-taken branch is the same as a > nop on every x86 made in the last many years. > See http://www.agner.org/optimize/instruction_tables.pdf > <http://www.agner.org/optimize/instruction_tables.pdf> > Generally speaking a correctly-predicted not-taken branch is basically >

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

2019 Aug 20

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

"H. Peter Anvin" <hpa at zytor.com> wrote August 20, 2019 12:51 AM: > On 8/14/19 9:42 PM, Stefan Kanthak wrote: >> Hi, >> >> both >> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__ashldi3.S >> and >> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S

Adding support for self-modifying branches to LLVM?

2016 Jan 19

Adding support for self-modifying branches to LLVM?

Hi, I’m thinking about using LLVM to implement a limited form of self-modifying code. Before diving into that, I’d like to get some feedback from you all. *The goal:* I’d like to add “optional” code to a program that I can enable at runtime and that has zero (i.e., as close to zero as I can get) overhead when not enabled. *Existing solutions:* Currently, I can guard optional code using a

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

2019 Aug 15

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

Hi, both https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__ashldi3.S and https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S use the following code sequences for shift counts greater 31: 1: 1: xorl %edx,%edx shrl %cl,%edx shl %cl,%eax xorl %eax,%eax

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote: ... > I have actually timed said instructions in the past and reproduced Agner > Fog's results. I just prefer to speak by referring to facts that can not be > misconstrued as hearsay = ). That would be great. Also, can you point me to the Agner Fog table that you are referring to? Thanks.

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[You can find an easier to read and more complete version of this RFC here <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> .] Knowing instruction scheduling properties (latency, uops) is the basis for all scheduling work done by LLVM. Unfortunately, vendors usually release only partial (and sometimes incorrect) information. Updating the

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

2014 Dec 21

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

Which performance guidelines are you referring to? I'm not that familiar with decade-old CPUs, but to the best of my knowledge, this is not true on current hardware. There is one specific circumstance where PUSHes should be avoided - for Atom/Silvermont processors, the memory form of PUSH is inefficient, so the register-freeing optimization below may not be profitable (see 14.3.3.6 and

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for

Why did Intel change his static branch prediction mechanism during these years?

2018 Aug 14

Why did Intel change his static branch prediction mechanism during these years?

( I don't know if it's allowed to ask such question, if not, please remind me. ) I know Intel implemented several static branch prediction mechanisms these years: * 80486 age: Always-not-take * Pentium4 age: Backwards Taken/Forwards Not-Taken * PM, Core2: Didn't use static prediction, randomly depending on what happens to be in corresponding BTB entry , according to agner's

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote: > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > > [You can find an easier to read and more complete version of this RFC here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> > .] > > Knowing

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > [You can find an easier to read and more complete version of this RFC > here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.] > > Knowing instruction scheduling properties (latency, uops) is the basis > for all scheduling work done by LLVM. > > >

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

2012 Sep 29

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Hi, We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM's pre-allocation scheduling algorithm. First, here is a

[LLVMdev] X86TargetLowering::LowerToBT

2015 Jan 22

[LLVMdev] X86TargetLowering::LowerToBT

> On Jan 22, 2015, at 1:22 PM, Fiona Glaser <fglaser at apple.com> wrote: > > According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Sounds like a very useful tool. Thank you for contributing. Taking a step back and looking at the big picture, combining this with the recently contributed llvm-mca dramatically improves our scheduling and performance analysis story. Being able to take a snippet of code on a particular machine, measure latency/throughput/ports for each instruction (this tool), and then analyze the entire

[LLVMdev] Macro-op fusion experiment

2011 Apr 08

[LLVMdev] Macro-op fusion experiment

On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote: >>> 8B C3 mov eax, ebx >>> 03 C1 add eax, ecx >>> becomes >>> 8B C3 03 C1 add eax, ebx, ecx > > In my understanding, twoaddr pass tends to emit such a sequence. Yes, it always does, and the coalescer tries very hard to eliminate the copy. > Though I

cwrsync and link-dest option

2010 Mar 15

cwrsync and link-dest option

Hello, In a small environment I have to backup two servers, an Ubuntu 9.10 and a Windows Server 2008 machine. My Backuphost is a Ubuntu 9.10 machine as well. I installed rsync on both Ubuntu hosts from repository (3.0.6) and cwrsync from http://www.itefix.no/i2/node/10650 (3.0.7). Then I wrote some Bash-Scripts which executes rsync every week like that: BACKUPDIR=/var/backup rsync -v -a

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

2012 Sep 29

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

On Sep 29, 2012, at 2:43 AM, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote: > Hi, > > We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark

[LLVMdev] AVX broadcast Vs. vector constant pool load

2012 Nov 07

[LLVMdev] AVX broadcast Vs. vector constant pool load

Hey guys, I'm currently investigating broadcasts from the constant pool on Sandy Bridge. I see this comment in llvm/lib/Target/X86/X86ISelLowering.cpp: // Handle the broadcasting a single constant scalar from the constant pool // into a vector. On Sandybridge it is still better to load a constant vector // from the constant pool and not to broadcast it from a scalar. Would anyone

similar to: How shall I evaluate the latency of each instruction in LLVM IR?