similar to: [LLVMdev] AVX broadcast Vs. vector constant pool load

Displaying 20 results from an estimated 1000 matches similar to: "[LLVMdev] AVX broadcast Vs. vector constant pool load"

2012 Jul 27
3
[LLVMdev] X86 FMA4
> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for
2012 Jul 27
0
[LLVMdev] X86 FMA4
Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps)
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > [You can find an easier to read and more complete version of this RFC > here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.] > > Knowing instruction scheduling properties (latency, uops) is the basis > for all scheduling work done by LLVM. > > >
2018 Mar 15
5
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
[You can find an easier to read and more complete version of this RFC here <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> .] Knowing instruction scheduling properties (latency, uops) is the basis for all scheduling work done by LLVM. Unfortunately, vendors usually release only partial (and sometimes incorrect) information. Updating the
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
Sounds like a very useful tool.  Thank you for contributing. Taking a step back and looking at the big picture, combining this with the recently contributed llvm-mca dramatically improves our scheduling and performance analysis story.  Being able to take a snippet of code on a particular machine, measure latency/throughput/ports for each instruction (this tool), and then analyze the entire
2018 Mar 15
3
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote: > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: > > [You can find an easier to read and more complete version of this RFC here > <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#> > .] > > Knowing
2012 Jul 27
2
[LLVMdev] X86 FMA4
Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a
2013 Aug 28
3
[PATCH] x86: AVX instruction emulation fixes
- we used the C4/C5 (first prefix) byte instead of the apparent ModR/M one as the second prefix byte - early decoding normalized vex.reg, thus corrupting it for the main consumer (copy_REX_VEX()), resulting in #UD on the two-operand instructions we emulate Also add respective test cases to the testing utility plus - fix get_fpu() (the fall-through order was inverted) - add cpu_has_avx2,
2012 Jun 27
1
[PATCH] x86/hvm: increase struct hvm_vcpu_io's mmio_large_read
Since the emulator now supports a few 256-bit memory operations, this array needs to follow (and the comments should, too). To limit growth, re-order the mmio_large_write_* fields so that the two mmio_large_*_bytes fields end up adjacent to each other. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/include/asm-x86/hvm/vcpu.h +++ b/xen/include/asm-x86/hvm/vcpu.h @@ -59,13 +59,13
2018 Mar 15
0
[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
On 03/15/2018 10:49 AM, Clement Courbet wrote: > > > On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote: >> [You can find an easier to read and more complete version of this >> RFC here >>
2011 Nov 30
0
[PATCH 2/4] x86/emulator: add emulation of SIMD FP moves
Clone the existing movq emulation to also support the most fundamental SIMD FP moves. Extend the testing code to also exercise these instructions. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/tools/tests/x86_emulator/test_x86_emulator.c +++ b/tools/tests/x86_emulator/test_x86_emulator.c @@ -629,6 +629,60 @@ int main(int argc, char **argv) else
2020 Jan 16
2
[llvm-exegesis]?==?utf-8?q? [RFC] Renaming Uops- classes
Since the option of running -mode=inverse_throughput was added to llvm-exegesis the names of classes like UopsSnippetGenerator and UopsBenchmarkRunner, that this mode shares with uops, started to be less descriptive. Inverse_throughput doesn't use the uops counters, so for example, the instruction layout shared between these two modes is really connected to parallelism, not uops. It's
2012 Jan 09
3
[LLVMdev] Calling conventions for YMM registers on AVX
On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote: > > On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote: > >> I'll explain what we see in the code. >> 1. The caller saves XMM registers across the call if needed (according to DEFS definition). >> YMMs are not in the set, so caller does not take care. > > This is not how the register allocator
2020 May 10
2
[llvm-mca] Resource consumption of ProcResGroups
> On May 9, 2020, at 5:12 PM, Andrea Di Biagio via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > The llvm scheduling model is quite simple and doesn't allow mca to accurately simulate the execution of individual uOPs. That limitation is sort-of acceptable if you consider how the scheduling model framework was originally designed with a different goal in mind (i.e. machine
2020 May 10
2
[llvm-mca] Resource consumption of ProcResGroups
Hi Alex, On Sun, May 10, 2020 at 4:00 PM Alex Renda <renda at csail.mit.edu> wrote: > Thanks, that’s very helpful! > > > > Also, sorry for the miscue on that bug with the 2/4 cycles — I realize now > that that’s an artifact of a change that I made to not crash when resource > groups overlap without all atomic subunits being specified: > > `echo 'fxrstor
2012 Jan 10
0
[LLVMdev] Calling conventions for YMM registers on AVX
This is the wrong code: declare <16 x float> @foo(<16 x float>) define <16 x float> @test(<16 x float> %x, <16 x float> %y) nounwind { entry: %x1 = fadd <16 x float> %x, %y %call = call <16 x float> @foo(<16 x float> %x1) nounwind %y1 = fsub <16 x float> %call, %y ret <16 x float> %y1 } ./llc -mattr=+avx
2014 Dec 22
2
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of Herbie Robinson > Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences > > On 12/21/14 4:27 AM, Kuperstein, Michael M wrote: > > Which performance guidelines are you referring to? > Table C-21 in "Intel(r) 64 and IA-32 Architectures
2019 Dec 17
2
[llvm-exegesis] Uops mode isnćt working
Hello, I've been testing llvm-exegesis on X86. Latency and inverse_throughput modes work fine but when I run uops I get an error: event not found - cannot create event uops_dispatched_port:port_0 LLVM ERROR: invalid perf event 'uops_dispatched_port:port_0' I'm running this on a i7-4790K. Am I missing something on my computer or is this not yet fully implemented? This also
2012 Jul 27
0
[LLVMdev] X86 FMA4
On Fri, Jul 27, 2012 at 2:37 PM, Michael Gottesman <mgottesman at apple.com> wrote: ... > I have actually timed said instructions in the past and reproduced Agner > Fog's results. I just prefer to speak by referring to facts that can not be > misconstrued as hearsay = ). That would be great. Also, can you point me to the Agner Fog table that you are referring to? Thanks.
2015 May 04
2
[LLVMdev] Load value and broadcast in LLVM
Hi Shahid, Thank you so much for your response. You suggested approach is what I am right now using. However, it seems that the overhead is a little bit high because we are introducing two more instructions. I was wondering if there was a cheaper way to do it. Best, Zhi On Mon, May 4, 2015 at 2:12 AM, Shahid, Asghar-ahmad < Asghar-ahmad.Shahid at amd.com> wrote: > Hi Zhi, > >