Displaying 20 results from an estimated 72 matches for "agners".
Did you mean:
agner
2014 Dec 22
2
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
> On Behalf Of Herbie Robinson
> Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
> > On 12/21/14 4:27 AM, Kuperstein, Michael M wrote:
> > Which performance guidelines are you referring to?
> Table C-21 in "Intel(r) 64 and IA-32 Architectures
2018 Aug 14
4
Why did Intel change his static branch prediction mechanism during these years?
( I don't know if it's allowed to ask such question, if not, please remind me. )
I know Intel implemented several static branch prediction mechanisms
these years:
* 80486 age: Always-not-take
* Pentium4 age: Backwards Taken/Forwards Not-Taken
* PM, Core2: Didn't use static prediction, randomly depending on
what happens to be in corresponding BTB entry , according to agner's
2012 Jul 27
0
[LLVMdev] X86 FMA4
On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote:
...
> I have actually timed said instructions in the past and reproduced Agner
> Fog's results. I just prefer to speak by referring to facts that can not be
> misconstrued as hearsay = ).
That would be great. Also, can you point me to the Agner Fog table
that you are referring to? Thanks.
2012 Jul 27
3
[LLVMdev] X86 FMA4
> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding.
You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for
2019 May 13
3
How shall I evaluate the latency of each instruction in LLVM IR?
Inspired by https://www.agner.org/optimize/instruction_tables.pdf, which
gives us the latency and reciprocal throughput of each instruction in the
different architecture of X86, Is there anybody taking the effort to do a
similar job for LLVM IR?
Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
2018 Mar 15
5
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
[You can find an easier to read and more complete version of this RFC here
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
.]
Knowing instruction scheduling properties (latency, uops) is the basis for
all scheduling work done by LLVM.
Unfortunately, vendors usually release only partial (and sometimes
incorrect) information. Updating the
2019 Aug 20
1
Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S
"H. Peter Anvin" <hpa at zytor.com> wrote August 20, 2019 12:51 AM:
> On 8/14/19 9:42 PM, Stefan Kanthak wrote:
>> Hi,
>>
>> both
>> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__ashldi3.S
>> and
>> https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S
2016 Jan 21
2
Adding support for self-modifying branches to LLVM?
On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote:
>
> AFAIK, the cost of a well-predicted, not-taken branch is the same as a
> nop on every x86 made in the last many years.
> See http://www.agner.org/optimize/instruction_tables.pdf
> <http://www.agner.org/optimize/instruction_tables.pdf>
> Generally speaking a correctly-predicted not-taken branch is basically
>
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
> [You can find an easier to read and more complete version of this RFC
> here
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>
> Knowing instruction scheduling properties (latency, uops) is the basis
> for all scheduling work done by LLVM.
>
>
>
2018 Mar 15
3
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
>
> [You can find an easier to read and more complete version of this RFC here
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
> .]
>
> Knowing
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
Sounds like a very useful tool. Thank you for contributing.
Taking a step back and looking at the big picture, combining this with
the recently contributed llvm-mca dramatically improves our scheduling
and performance analysis story. Being able to take a snippet of code on
a particular machine, measure latency/throughput/ports for each
instruction (this tool), and then analyze the entire
2012 Sep 29
7
[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler
Hi,
We are currently working on revising a journal article that describes
our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM's pre-allocation scheduling algorithm.
First, here is a
2015 Jan 22
2
[LLVMdev] X86TargetLowering::LowerToBT
> On Jan 22, 2015, at 1:22 PM, Fiona Glaser <fglaser at apple.com> wrote:
>
> According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where
2012 Jul 27
2
[LLVMdev] X86 FMA4
Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory.
vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5.
vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1.
He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a
2012 Jul 27
0
[LLVMdev] X86 FMA4
Hey Michael,
Thanks for the legwork!
It appears that the stats you listed are for movaps [SSE], not vmovaps
[AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256),
since they are both AVX instructions. Although, yes, I agree that this is
not clear from Agner's report. Please correct me if I am misunderstanding.
As I am sure you are aware, we cannot use SSE (movaps)
2012 Sep 29
0
[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler
On Sep 29, 2012, at 2:43 AM, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote:
> Hi,
>
> We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark
2011 Apr 17
0
[LLVMdev] Macro-op fusion experiment
Hi Jacob,
As far as I know, an x86 'mov' instruction always uses an ALU resource.
According to Agner Fog's documents (http://www.agner.org/optimize/), it can
execute on port 0, 1 or 5 on recent architectures though. So it's not that
likely to be resource limited. But it still occupies an instruction slot
throughout the entire pipeline, costing power and potentially limiting other
2012 Nov 07
1
[LLVMdev] AVX broadcast Vs. vector constant pool load
Hey guys,
I'm currently investigating broadcasts from the constant pool on Sandy
Bridge. I see this comment in llvm/lib/Target/X86/X86ISelLowering.cpp:
// Handle the broadcasting a single constant scalar from the constant
pool
// into a vector. On Sandybridge it is still better to load a constant
vector
// from the constant pool and not to broadcast it from a scalar.
Would anyone
2014 Jan 03
1
PATCH: match calls and returns
According to Agner Fog, "...you must make sure that all calls
are matched with returns. Never jump out of a subroutine without
a return and never use a return as an indirect jump."
(see paragraph 3.15 in microarchitecture.pdf and
examples 3.5a and 3.5b in optimizing_assembly.pdf)
Basically this patch replaces
call .get_eip0
.get_eip0:
pop eax
with
call .mov_eip_to_eax
2014 Jan 14
1
PATCH for lpc_asm.nasm
1) Two comments ";ASSERT(lp_quantization <= 31)" in the new functions ..._wide_asm_ia32()
-- just to mention this constraint.
(max. possible value of lp_quantization is 15, so it's not a problem)
2) "mov cl, ..." was replaced with "mov ecx, ..." (again Agner Fog, optimizing_assembly.pdf)
summary: write to a partial register may result in false dependencies