thr3ads.net - similar to: "PATCH: match calls and returns"

Displaying 20 results from an estimated 900 matches similar to: "PATCH: match calls and returns"

2014 Jan 14

PATCH for lpc_asm.nasm

1) Two comments ";ASSERT(lp_quantization <= 31)" in the new functions ..._wide_asm_ia32() -- just to mention this constraint. (max. possible value of lp_quantization is 15, so it's not a problem) 2) "mov cl, ..." was replaced with "mov ecx, ..." (again Agner Fog, optimizing_assembly.pdf) summary: write to a partial register may result in false dependencies

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > [You can find an easier to read and more complete version of this RFC > here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.] > > Knowing instruction scheduling properties (latency, uops) is the basis > for all scheduling work done by LLVM. > > >

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Sounds like a very useful tool. Thank you for contributing. Taking a step back and looking at the big picture, combining this with the recently contributed llvm-mca dramatically improves our scheduling and performance analysis story. Being able to take a snippet of code on a particular machine, measure latency/throughput/ports for each instruction (this tool), and then analyze the entire

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote: > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > > [You can find an easier to read and more complete version of this RFC here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> > .] > > Knowing

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[You can find an easier to read and more complete version of this RFC here <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> .] Knowing instruction scheduling properties (latency, uops) is the basis for all scheduling work done by LLVM. Unfortunately, vendors usually release only partial (and sometimes incorrect) information. Updating the

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote: ... > I have actually timed said instructions in the past and reproduced Agner > Fog's results. I just prefer to speak by referring to facts that can not be > misconstrued as hearsay = ). That would be great. Also, can you point me to the Agner Fog table that you are referring to? Thanks.

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

2014 Dec 22

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of Herbie Robinson > Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences > > On 12/21/14 4:27 AM, Kuperstein, Michael M wrote: > > Which performance guidelines are you referring to? > Table C-21 in "Intel(r) 64 and IA-32 Architectures

Why did Intel change his static branch prediction mechanism during these years?

2018 Aug 14

Why did Intel change his static branch prediction mechanism during these years?

( I don't know if it's allowed to ask such question, if not, please remind me. ) I know Intel implemented several static branch prediction mechanisms these years: * 80486 age: Always-not-take * Pentium4 age: Backwards Taken/Forwards Not-Taken * PM, Core2: Didn't use static prediction, randomly depending on what happens to be in corresponding BTB entry , according to agner's

patch

2004 Sep 10

patch

So here is quick patch solving the problem, now it should be PIC. -- Miroslav Lichvar lichvarm@phoenix.inf.upol.cz -------------- next part -------------- --- lpc_asm.nasm.orig Wed Jul 18 02:23:40 2001 +++ lpc_asm.nasm Sat Nov 17 21:09:46 2001 @@ -59,10 +59,10 @@ ; ALIGN 16 cident FLAC__lpc_compute_autocorrelation_asm_ia32 - ;[esp + 24] == autoc[] - ;[esp + 20] == lag - ;[esp + 16] ==

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On 03/15/2018 10:49 AM, Clement Courbet wrote: > > > On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: >> [You can find an easier to read and more complete version of this >> RFC here >>

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

2019 Aug 20

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

"H. Peter Anvin" <hpa at zytor.com> wrote August 20, 2019 12:51 AM: > On 8/14/19 9:42 PM, Stefan Kanthak wrote: >> Hi, >> >> both >> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__ashldi3.S >> and >> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S

How shall I evaluate the latency of each instruction in LLVM IR?

2019 May 13

How shall I evaluate the latency of each instruction in LLVM IR?

Inspired by https://www.agner.org/optimize/instruction_tables.pdf, which gives us the latency and reciprocal throughput of each instruction in the different architecture of X86, Is there anybody taking the effort to do a similar job for LLVM IR? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL:

Adding support for self-modifying branches to LLVM?

2016 Jan 21

Adding support for self-modifying branches to LLVM?

On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote: > > AFAIK, the cost of a well-predicted, not-taken branch is the same as a > nop on every x86 made in the last many years. > See http://www.agner.org/optimize/instruction_tables.pdf > <http://www.agner.org/optimize/instruction_tables.pdf> > Generally speaking a correctly-predicted not-taken branch is basically >

PATCH: asm versions for two _wide() functions

2014 Jan 03

PATCH: asm versions for two _wide() functions

As I wrote earlier, GCC generates slow ia32 code for FLAC__lpc_compute_residual_from_qlp_coefficients_wide() and FLAC__lpc_restore_signal_wide(). So 24-bit encoding/decoding is slower for GCC compile than for MSVS or ICC compile. I took FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32 and FLAC__lpc_restore_signal_asm_ia32 asm functions and wrote their _wide versions. -------------- next

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps)

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a

[LLVMdev] AVX broadcast Vs. vector constant pool load

2012 Nov 07

[LLVMdev] AVX broadcast Vs. vector constant pool load

Hey guys, I'm currently investigating broadcasts from the constant pool on Sandy Bridge. I see this comment in llvm/lib/Target/X86/X86ISelLowering.cpp: // Handle the broadcasting a single constant scalar from the constant pool // into a vector. On Sandybridge it is still better to load a constant vector // from the constant pool and not to broadcast it from a scalar. Would anyone

RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

2017 Nov 03

RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev < llvm-dev at lists.llvm.org> wrote: > On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hello all, >> >> >> >> I would like to propose adding the -mprefer-avx256 and -mprefer-avx128 >> command line flags supported by latest GCC to

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

2012 Sep 29

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Hi, We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM's pre-allocation scheduling algorithm. First, here is a

similar to: PATCH: match calls and returns