thr3ads.net - similar to: "[LLVMdev] AVX broadcast Vs. vector constant pool load"

Displaying 20 results from an estimated 1000 matches similar to: "[LLVMdev] AVX broadcast Vs. vector constant pool load"

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps)

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > [You can find an easier to read and more complete version of this RFC > here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.] > > Knowing instruction scheduling properties (latency, uops) is the basis > for all scheduling work done by LLVM. > > >

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[You can find an easier to read and more complete version of this RFC here <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> .] Knowing instruction scheduling properties (latency, uops) is the basis for all scheduling work done by LLVM. Unfortunately, vendors usually release only partial (and sometimes incorrect) information. Updating the

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Sounds like a very useful tool. Thank you for contributing. Taking a step back and looking at the big picture, combining this with the recently contributed llvm-mca dramatically improves our scheduling and performance analysis story. Being able to take a snippet of code on a particular machine, measure latency/throughput/ports for each instruction (this tool), and then analyze the entire

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote: > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > > [You can find an easier to read and more complete version of this RFC here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> > .] > > Knowing

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a

[PATCH] x86: AVX instruction emulation fixes

2013 Aug 28

[PATCH] x86: AVX instruction emulation fixes

- we used the C4/C5 (first prefix) byte instead of the apparent ModR/M one as the second prefix byte - early decoding normalized vex.reg, thus corrupting it for the main consumer (copy_REX_VEX()), resulting in #UD on the two-operand instructions we emulate Also add respective test cases to the testing utility plus - fix get_fpu() (the fall-through order was inverted) - add cpu_has_avx2,

[PATCH] x86/hvm: increase struct hvm_vcpu_io's mmio_large_read

2012 Jun 27

[PATCH] x86/hvm: increase struct hvm_vcpu_io's mmio_large_read

Since the emulator now supports a few 256-bit memory operations, this array needs to follow (and the comments should, too). To limit growth, re-order the mmio_large_write_* fields so that the two mmio_large_*_bytes fields end up adjacent to each other. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/include/asm-x86/hvm/vcpu.h +++ b/xen/include/asm-x86/hvm/vcpu.h @@ -59,13 +59,13

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On 03/15/2018 10:49 AM, Clement Courbet wrote: > > > On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: >> [You can find an easier to read and more complete version of this >> RFC here >>

[PATCH 2/4] x86/emulator: add emulation of SIMD FP moves

2011 Nov 30

[PATCH 2/4] x86/emulator: add emulation of SIMD FP moves

Clone the existing movq emulation to also support the most fundamental SIMD FP moves. Extend the testing code to also exercise these instructions. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/tools/tests/x86_emulator/test_x86_emulator.c +++ b/tools/tests/x86_emulator/test_x86_emulator.c @@ -629,6 +629,60 @@ int main(int argc, char **argv) else

[llvm-exegesis]?==?utf-8?q? [RFC] Renaming Uops- classes

2020 Jan 16

[llvm-exegesis]?==?utf-8?q? [RFC] Renaming Uops- classes

Since the option of running -mode=inverse_throughput was added to llvm-exegesis the names of classes like UopsSnippetGenerator and UopsBenchmarkRunner, that this mode shares with uops, started to be less descriptive. Inverse_throughput doesn't use the uops counters, so for example, the instruction layout shared between these two modes is really connected to parallelism, not uops. It's

[LLVMdev] Calling conventions for YMM registers on AVX

2012 Jan 09

[LLVMdev] Calling conventions for YMM registers on AVX

On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote: > > On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote: > >> I'll explain what we see in the code. >> 1. The caller saves XMM registers across the call if needed (according to DEFS definition). >> YMMs are not in the set, so caller does not take care. > > This is not how the register allocator

[llvm-mca] Resource consumption of ProcResGroups

2020 May 10

[llvm-mca] Resource consumption of ProcResGroups

> On May 9, 2020, at 5:12 PM, Andrea Di Biagio via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > The llvm scheduling model is quite simple and doesn't allow mca to accurately simulate the execution of individual uOPs. That limitation is sort-of acceptable if you consider how the scheduling model framework was originally designed with a different goal in mind (i.e. machine

[llvm-mca] Resource consumption of ProcResGroups

2020 May 10

[llvm-mca] Resource consumption of ProcResGroups

Hi Alex, On Sun, May 10, 2020 at 4:00 PM Alex Renda <renda at csail.mit.edu> wrote: > Thanks, that’s very helpful! > > > > Also, sorry for the miscue on that bug with the 2/4 cycles — I realize now > that that’s an artifact of a change that I made to not crash when resource > groups overlap without all atomic subunits being specified: > > `echo 'fxrstor

[LLVMdev] Calling conventions for YMM registers on AVX

2012 Jan 10

[LLVMdev] Calling conventions for YMM registers on AVX

This is the wrong code: declare <16 x float> @foo(<16 x float>) define <16 x float> @test(<16 x float> %x, <16 x float> %y) nounwind { entry: %x1 = fadd <16 x float> %x, %y %call = call <16 x float> @foo(<16 x float> %x1) nounwind %y1 = fsub <16 x float> %call, %y ret <16 x float> %y1 } ./llc -mattr=+avx

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

2014 Dec 22

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of Herbie Robinson > Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences > > On 12/21/14 4:27 AM, Kuperstein, Michael M wrote: > > Which performance guidelines are you referring to? > Table C-21 in "Intel(r) 64 and IA-32 Architectures

[llvm-exegesis] Uops mode isnćt working

2019 Dec 17

[llvm-exegesis] Uops mode isnćt working

Hello, I've been testing llvm-exegesis on X86. Latency and inverse_throughput modes work fine but when I run uops I get an error: event not found - cannot create event uops_dispatched_port:port_0 LLVM ERROR: invalid perf event 'uops_dispatched_port:port_0' I'm running this on a i7-4790K. Am I missing something on my computer or is this not yet fully implemented? This also

[LLVMdev] X86 FMA4

2012 Jul 27

[LLVMdev] X86 FMA4

On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote: ... > I have actually timed said instructions in the past and reproduced Agner > Fog's results. I just prefer to speak by referring to facts that can not be > misconstrued as hearsay = ). That would be great. Also, can you point me to the Agner Fog table that you are referring to? Thanks.

[LLVMdev] Load value and broadcast in LLVM

2015 May 04

[LLVMdev] Load value and broadcast in LLVM

Hi Shahid, Thank you so much for your response. You suggested approach is what I am right now using. However, it seems that the overhead is a little bit high because we are introducing two more instructions. I was wondering if there was a cheaper way to do it. Best, Zhi On Mon, May 4, 2015 at 2:12 AM, Shahid, Asghar-ahmad < Asghar-ahmad.Shahid at amd.com> wrote: > Hi Zhi, > >

similar to: [LLVMdev] AVX broadcast Vs. vector constant pool load