search for: agner

Displaying 20 results from an estimated 71 matches for "agner".

Did you mean: wagner
2014 Dec 22
2
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
...> > Which performance guidelines are you referring to? > Table C-21 in "Intel(r) 64 and IA-32 Architectures Optimization Reference Manual", September 2014. > It hasn't changed. It still lists push and pop instructions as 2-3 times more expensive as mov. And verified by Agner Fog's independent measurements: http://www.agner.org/optimize/instruction_tables.pdf The relevant Haswell numbers are on pages 186 - 187. -Chuck
2018 Aug 14
4
Why did Intel change his static branch prediction mechanism during these years?
...I know Intel implemented several static branch prediction mechanisms these years: * 80486 age: Always-not-take * Pentium4 age: Backwards Taken/Forwards Not-Taken * PM, Core2: Didn't use static prediction, randomly depending on what happens to be in corresponding BTB entry , according to agner's optimization guide ¹. * Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, according to Matt G's experiment ². And Intel seems don't want to talk about it any more, because the latest material I found within Intel Document was written about ten years ago. I k...
2012 Jul 27
0
[LLVMdev] X86 FMA4
On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote: ... > I have actually timed said instructions in the past and reproduced Agner > Fog's results. I just prefer to speak by referring to facts that can not be > misconstrued as hearsay = ). That would be great. Also, can you point me to the Agner Fog table that you are referring to? Thanks.
2012 Jul 27
3
[LLVMdev] X86 FMA4
> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for vmovaps of the form, vmovaps %xmm0, (mem) i.e., its form as a 128 bit AVX instruction. Let me explain. There are 3 categories of instructions we...
2019 May 13
3
How shall I evaluate the latency of each instruction in LLVM IR?
Inspired by https://www.agner.org/optimize/instruction_tables.pdf, which gives us the latency and reciprocal throughput of each instruction in the different architecture of X86, Is there anybody taking the effort to do a similar job for LLVM IR? Thanks! -------------- next part -------------- An HTML attachment was scrubbed.....
2018 Mar 15
5
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
...p. uop decomposition) of the instruction. The code snippet is jitted and executed on the host subtarget. The time taken (resp. resource usage) is measured using hardware performance counters. More details can be found in the ‘implementation’ section of the RFC. For people familiar with the work of Agner Fog, this is essentially an automation of the process of building the code snippets using instruction descriptions from LLVM. Results - Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084> (sandybridge): > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency...
2019 Aug 20
1
Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S
...er-register. "but should be fine" is not enough: XCHG is of course slow for register- register operations too, otherwise I would not have spend time to write in. See https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures or Agner Fogs http://www.agner.org/optimize/instruction_tables.pdf > Remember, too, that klibc is optimized for size. Remember that the linker aligns functions on 16 byte boundaries! With XCHG, these functions have a code size of 29 bytes; with MOV they grow by 1 byte. >> PS: I doubt that a cur...
2016 Jan 21
2
Adding support for self-modifying branches to LLVM?
On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote: > > AFAIK, the cost of a well-predicted, not-taken branch is the same as a > nop on every x86 made in the last many years. > See http://www.agner.org/optimize/instruction_tables.pdf > <http://www.agner.org/optimize/instruction_tables.pdf> > Generally speaking a correctly-predicted not-taken branch is basically > identical to a nop, and a correctly-predicted taken branch is has an > extra overhead similar to an "add&q...
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
...instruction. The code snippet is jitted and executed on the host > subtarget. The time taken (resp. resource usage) is measured using > hardware performance counters. More details can be found in the > ‘implementation’ section of the RFC. > > > For people familiar with the work of Agner Fog, this is essentially an > automation of the process of building the code snippets using > instruction descriptions from LLVM. > > > Results > > * > > Solving this bug > <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge): > > > llvm...
2018 Mar 15
3
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
...the instruction. The code snippet > is jitted and executed on the host subtarget. The time taken (resp. > resource usage) is measured using hardware performance counters. More > details can be found in the ‘implementation’ section of the RFC. > > For people familiar with the work of Agner Fog, this is essentially an > automation of the process of building the code snippets using instruction > descriptions from LLVM. > Results > > - > > Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084> > (sandybridge): > > > llvm-exegesis -...
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
...truction. The code snippet is jitted and executed on the host > subtarget. The time taken (resp. resource usage) is measured using > hardware performance counters. More details can be found in the > ‘implementation’ section of the RFC. > > > For people familiar with the work of Agner Fog, this is essentially an > automation of the process of building the code snippets using > instruction descriptions from LLVM. > > > Results > > * > > Solving this bug > <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge): > > > ll...
2012 Sep 29
7
[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler
...of some benchmarks relative to LLVM's default scheduler by up to 21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then observed that LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we added the precise latency values published by Agner (http://www.agner.org/optimize/) and that led to more speedup relative to LLVM's ILP scheduler on some benchmarks. The most significant gain from adding precise latencies was on the gromacs benchmark, which has a high degree of ILP. I am attaching the benchmarking results on x86-64 using both...
2015 Jan 22
2
[LLVMdev] X86TargetLowering::LowerToBT
> On Jan 22, 2015, at 1:22 PM, Fiona Glaser <fglaser at apple.com> wrote: > > According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where imm32 has on...
2012 Jul 27
2
[LLVMdev] X86 FMA4
Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd...
2012 Jul 27
0
[LLVMdev] X86 FMA4
Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps) instructions in an AVX context, or else we'll pay the context switch penalty. It might be too big an assumption to assume that movaps and vmovaps have the same timings. Same for mov...
2012 Sep 29
0
[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler
...of some benchmarks relative to LLVM's default scheduler by up to 21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then observed that LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we added the precise latency values published by Agner (http://www.agner.org/optimize/) and that led to more speedup relative to LLVM's ILP scheduler on some benchmarks. The most significant gain from adding precise latencies was on the gromacs benchmark, which has a high degree of ILP. I am attaching the benchmarking results on x86-64 using both L...
2011 Apr 17
0
[LLVMdev] Macro-op fusion experiment
Hi Jacob, As far as I know, an x86 'mov' instruction always uses an ALU resource. According to Agner Fog's documents (http://www.agner.org/optimize/), it can execute on port 0, 1 or 5 on recent architectures though. So it's not that likely to be resource limited. But it still occupies an instruction slot throughout the entire pipeline, costing power and potentially limiting other actual ar...
2012 Nov 07
1
[LLVMdev] AVX broadcast Vs. vector constant pool load
...constant pool // into a vector. On Sandybridge it is still better to load a constant vector // from the constant pool and not to broadcast it from a scalar. Would anyone be able to explain why it is better to load a vector from the constant pool rather than broadcast a scalar? I checked out Agner Fog's tables, but it wasn't so obvious to me... vmovaps y, m256: Uops: 1 Lat: 4 Throughput: 1 vbroadcastsd y, m64: Uops: 2 Lat: [Not or cannot be measured] Throughput: 1 Thanks in advance, Cameron -------------- next part -------------- An HTML attachment was scrubbed... UR...
2014 Jan 03
1
PATCH: match calls and returns
According to Agner Fog, "...you must make sure that all calls are matched with returns. Never jump out of a subroutine without a return and never use a return as an indirect jump." (see paragraph 3.15 in microarchitecture.pdf and examples 3.5a and 3.5b in optimizing_assembly.pdf) Basically this patch repl...
2014 Jan 14
1
PATCH for lpc_asm.nasm
1) Two comments ";ASSERT(lp_quantization <= 31)" in the new functions ..._wide_asm_ia32() -- just to mention this constraint. (max. possible value of lp_quantization is 15, so it's not a problem) 2) "mov cl, ..." was replaced with "mov ecx, ..." (again Agner Fog, optimizing_assembly.pdf) summary: write to a partial register may result in false dependencies between instructions, so it is better to avoid it. (also bitreader_asm.nasm and stream_encoder_asm.nasm both have "mov ecx, ..." instructions, and no "mov cl, ..."). ------------...