thr3ads.net - search: "agner"

Displaying 20 results from an estimated 72 matches for "agner".

Did you mean: wagner

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

2014 Dec 22

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

...> > Which performance guidelines are you referring to? > Table C-21 in "Intel(r) 64 and IA-32 Architectures Optimization Reference Manual", September 2014. > It hasn't changed. It still lists push and pop instructions as 2-3 times more expensive as mov. And verified by Agner Fog's independent measurements: http://www.agner.org/optimize/instruction_tables.pdf The relevant Haswell numbers are on pages 186 - 187. -Chuck

Why did Intel change his static branch prediction mechanism during these years?

2018 Aug 14

Why did Intel change his static branch prediction mechanism during these years?

...I know Intel implemented several static branch prediction mechanisms these years: * 80486 age: Always-not-take * Pentium4 age: Backwards Taken/Forwards Not-Taken * PM, Core2: Didn't use static prediction, randomly depending on what happens to be in corresponding BTB entry , according to agner's optimization guide ¹. * Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, according to Matt G's experiment ². And Intel seems don't want to talk about it any more, because the latest material I found within Intel Document was written about ten years ago. I k...

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote: ... > I have actually timed said instructions in the past and reproduced Agner > Fog's results. I just prefer to speak by referring to facts that can not be > misconstrued as hearsay = ). That would be great. Also, can you point me to the Agner Fog table that you are referring to? Thanks.

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for vmovaps of the form, vmovaps %xmm0, (mem) i.e., its form as a 128 bit AVX instruction. Let me explain. There are 3 categories of instructions we...

How shall I evaluate the latency of each instruction in LLVM IR?

2019 May 13

How shall I evaluate the latency of each instruction in LLVM IR?

Inspired by https://www.agner.org/optimize/instruction_tables.pdf, which gives us the latency and reciprocal throughput of each instruction in the different architecture of X86, Is there anybody taking the effort to do a similar job for LLVM IR? Thanks! -------------- next part -------------- An HTML attachment was scrubbed.....

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

...p. uop decomposition) of the instruction. The code snippet is jitted and executed on the host subtarget. The time taken (resp. resource usage) is measured using hardware performance counters. More details can be found in the ‘implementation’ section of the RFC. For people familiar with the work of Agner Fog, this is essentially an automation of the process of building the code snippets using instruction descriptions from LLVM. Results - Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084> (sandybridge): > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency...

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

2019 Aug 20

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

...er-register. "but should be fine" is not enough: XCHG is of course slow for register- register operations too, otherwise I would not have spend time to write in. See https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures or Agner Fogs http://www.agner.org/optimize/instruction_tables.pdf > Remember, too, that klibc is optimized for size. Remember that the linker aligns functions on 16 byte boundaries! With XCHG, these functions have a code size of 29 bytes; with MOV they grow by 1 byte. >> PS: I doubt that a cur...

Adding support for self-modifying branches to LLVM?

2016 Jan 21

Adding support for self-modifying branches to LLVM?

On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote: > > AFAIK, the cost of a well-predicted, not-taken branch is the same as a > nop on every x86 made in the last many years. > See http://www.agner.org/optimize/instruction_tables.pdf > <http://www.agner.org/optimize/instruction_tables.pdf> > Generally speaking a correctly-predicted not-taken branch is basically > identical to a nop, and a correctly-predicted taken branch is has an > extra overhead similar to an "add&q...

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

...instruction. The code snippet is jitted and executed on the host > subtarget. The time taken (resp. resource usage) is measured using > hardware performance counters. More details can be found in the > ‘implementation’ section of the RFC. > > > For people familiar with the work of Agner Fog, this is essentially an > automation of the process of building the code snippets using > instruction descriptions from LLVM. > > > Results > > * > > Solving this bug > <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge): > > > llvm...

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

...the instruction. The code snippet > is jitted and executed on the host subtarget. The time taken (resp. > resource usage) is measured using hardware performance counters. More > details can be found in the ‘implementation’ section of the RFC. > > For people familiar with the work of Agner Fog, this is essentially an > automation of the process of building the code snippets using instruction > descriptions from LLVM. > Results > > - > > Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084> > (sandybridge): > > > llvm-exegesis -...

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

...truction. The code snippet is jitted and executed on the host > subtarget. The time taken (resp. resource usage) is measured using > hardware performance counters. More details can be found in the > ‘implementation’ section of the RFC. > > > For people familiar with the work of Agner Fog, this is essentially an > automation of the process of building the code snippets using > instruction descriptions from LLVM. > > > Results > > * > > Solving this bug > <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge): > > > ll...

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

2012 Sep 29

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

...of some benchmarks relative to LLVM's default scheduler by up to 21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then observed that LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we added the precise latency values published by Agner (http://www.agner.org/optimize/) and that led to more speedup relative to LLVM's ILP scheduler on some benchmarks. The most significant gain from adding precise latencies was on the gromacs benchmark, which has a high degree of ILP. I am attaching the benchmarking results on x86-64 using both...

[LLVMdev] X86TargetLowering::LowerToBT

2015 Jan 22

[LLVMdev] X86TargetLowering::LowerToBT

> On Jan 22, 2015, at 1:22 PM, Fiona Glaser <fglaser at apple.com> wrote: > > According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where imm32 has on...

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd...

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps) instructions in an AVX context, or else we'll pay the context switch penalty. It might be too big an assumption to assume that movaps and vmovaps have the same timings. Same for mov...

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

2012 Sep 29

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] Macro-op fusion experiment

2011 Apr 17

[LLVMdev] Macro-op fusion experiment

Hi Jacob, As far as I know, an x86 'mov' instruction always uses an ALU resource. According to Agner Fog's documents (http://www.agner.org/optimize/), it can execute on port 0, 1 or 5 on recent architectures though. So it's not that likely to be resource limited. But it still occupies an instruction slot throughout the entire pipeline, costing power and potentially limiting other actual ar...

[LLVMdev] AVX broadcast Vs. vector constant pool load

2012 Nov 07

[LLVMdev] AVX broadcast Vs. vector constant pool load

...constant pool // into a vector. On Sandybridge it is still better to load a constant vector // from the constant pool and not to broadcast it from a scalar. Would anyone be able to explain why it is better to load a vector from the constant pool rather than broadcast a scalar? I checked out Agner Fog's tables, but it wasn't so obvious to me... vmovaps y, m256: Uops: 1 Lat: 4 Throughput: 1 vbroadcastsd y, m64: Uops: 2 Lat: [Not or cannot be measured] Throughput: 1 Thanks in advance, Cameron -------------- next part -------------- An HTML attachment was scrubbed... UR...

PATCH: match calls and returns

2014 Jan 03

PATCH: match calls and returns

According to Agner Fog, "...you must make sure that all calls are matched with returns. Never jump out of a subroutine without a return and never use a return as an indirect jump." (see paragraph 3.15 in microarchitecture.pdf and examples 3.5a and 3.5b in optimizing_assembly.pdf) Basically this patch repl...

PATCH for lpc_asm.nasm

2014 Jan 14

PATCH for lpc_asm.nasm

1) Two comments ";ASSERT(lp_quantization <= 31)" in the new functions ..._wide_asm_ia32() -- just to mention this constraint. (max. possible value of lp_quantization is 15, so it's not a problem) 2) "mov cl, ..." was replaced with "mov ecx, ..." (again Agner Fog, optimizing_assembly.pdf) summary: write to a partial register may result in false dependencies between instructions, so it is better to avoid it. (also bitreader_asm.nasm and stream_encoder_asm.nasm both have "mov ecx, ..." instructions, and no "mov cl, ..."). ------------...

search for: agner