similar to: How shall I evaluate the latency of each instruction in LLVM IR?

Displaying 20 results from an estimated 500 matches similar to: "How shall I evaluate the latency of each instruction in LLVM IR?"

2014 Dec 22
2
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of Herbie Robinson > Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences > > On 12/21/14 4:27 AM, Kuperstein, Michael M wrote: > > Which performance guidelines are you referring to? > Table C-21 in "Intel(r) 64 and IA-32 Architectures
2016 Jan 21
2
Adding support for self-modifying branches to LLVM?
On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote: > > AFAIK, the cost of a well-predicted, not-taken branch is the same as a > nop on every x86 made in the last many years. > See http://www.agner.org/optimize/instruction_tables.pdf > <http://www.agner.org/optimize/instruction_tables.pdf> > Generally speaking a correctly-predicted not-taken branch is basically >
2019 Aug 20
1
Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S
"H. Peter Anvin" <hpa at zytor.com> wrote August 20, 2019 12:51 AM: > On 8/14/19 9:42 PM, Stefan Kanthak wrote: >> Hi, >> >> both >> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__ashldi3.S >> and >> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S
2016 Jan 19
4
Adding support for self-modifying branches to LLVM?
Hi, I’m thinking about using LLVM to implement a limited form of self-modifying code. Before diving into that, I’d like to get some feedback from you all. *The goal:* I’d like to add “optional” code to a program that I can enable at runtime and that has zero (i.e., as close to zero as I can get) overhead when not enabled. *Existing solutions:* Currently, I can guard optional code using a
2019 Aug 15
2
Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S
Hi, both https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__ashldi3.S and https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S use the following code sequences for shift counts greater 31: 1: 1: xorl %edx,%edx shrl %cl,%edx shl %cl,%eax xorl %eax,%eax
2012 Jul 27
0
[LLVMdev] X86 FMA4
On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote: ... > I have actually timed said instructions in the past and reproduced Agner > Fog's results. I just prefer to speak by referring to facts that can not be > misconstrued as hearsay = ). That would be great. Also, can you point me to the Agner Fog table that you are referring to? Thanks.
2018 Mar 15
5
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
[You can find an easier to read and more complete version of this RFC here <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> .] Knowing instruction scheduling properties (latency, uops) is the basis for all scheduling work done by LLVM. Unfortunately, vendors usually release only partial (and sometimes incorrect) information. Updating the
2014 Dec 21
2
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
Which performance guidelines are you referring to? I'm not that familiar with decade-old CPUs, but to the best of my knowledge, this is not true on current hardware. There is one specific circumstance where PUSHes should be avoided - for Atom/Silvermont processors, the memory form of PUSH is inefficient, so the register-freeing optimization below may not be profitable (see 14.3.3.6 and
2012 Jul 27
3
[LLVMdev] X86 FMA4
> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for
2018 Aug 14
4
Why did Intel change his static branch prediction mechanism during these years?
( I don't know if it's allowed to ask such question, if not, please remind me. ) I know Intel implemented several static branch prediction mechanisms these years: * 80486 age: Always-not-take * Pentium4 age: Backwards Taken/Forwards Not-Taken * PM, Core2: Didn't use static prediction, randomly depending on what happens to be in corresponding BTB entry , according to agner's
2018 Mar 15
3
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote: > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > > [You can find an easier to read and more complete version of this RFC here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> > .] > > Knowing
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > [You can find an easier to read and more complete version of this RFC > here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.] > > Knowing instruction scheduling properties (latency, uops) is the basis > for all scheduling work done by LLVM. > > >
2012 Sep 29
7
[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler
Hi, We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM's pre-allocation scheduling algorithm. First, here is a
2015 Jan 22
2
[LLVMdev] X86TargetLowering::LowerToBT
> On Jan 22, 2015, at 1:22 PM, Fiona Glaser <fglaser at apple.com> wrote: > > According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
Sounds like a very useful tool.  Thank you for contributing. Taking a step back and looking at the big picture, combining this with the recently contributed llvm-mca dramatically improves our scheduling and performance analysis story.  Being able to take a snippet of code on a particular machine, measure latency/throughput/ports for each instruction (this tool), and then analyze the entire
2011 Apr 08
3
[LLVMdev] Macro-op fusion experiment
On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote: >>> 8B C3 mov eax, ebx >>> 03 C1 add eax, ecx >>> becomes >>> 8B C3 03 C1 add eax, ebx, ecx > > In my understanding, twoaddr pass tends to emit such a sequence. Yes, it always does, and the coalescer tries very hard to eliminate the copy. > Though I
2010 Mar 15
2
cwrsync and link-dest option
Hello, In a small environment I have to backup two servers, an Ubuntu 9.10 and a Windows Server 2008 machine. My Backuphost is a Ubuntu 9.10 machine as well. I installed rsync on both Ubuntu hosts from repository (3.0.6) and cwrsync from http://www.itefix.no/i2/node/10650 (3.0.7). Then I wrote some Bash-Scripts which executes rsync every week like that: BACKUPDIR=/var/backup rsync -v -a
2012 Jul 27
2
[LLVMdev] X86 FMA4
Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a
2012 Sep 29
0
[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler
On Sep 29, 2012, at 2:43 AM, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote: > Hi, > > We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark
2012 Nov 07
1
[LLVMdev] AVX broadcast Vs. vector constant pool load
Hey guys, I'm currently investigating broadcasts from the constant pool on Sandy Bridge. I see this comment in llvm/lib/Target/X86/X86ISelLowering.cpp: // Handle the broadcasting a single constant scalar from the constant pool // into a vector. On Sandybridge it is still better to load a constant vector // from the constant pool and not to broadcast it from a scalar. Would anyone