thr3ads.net - search: "microarchitectur"

Displaying 20 results from an estimated 156 matches for "microarchitectur".

(RFC) Adjusting default loop fully unroll threshold

2017 Feb 13

(RFC) Adjusting default loop fully unroll threshold

...< llvm-dev at lists.llvm.org> wrote: > For unrolling specifically I agree with Hal that the hooks should be > target specific. Actually, I go further and think they should be uArch > specific. > They already are, it is just that no one has contributed a patch to use this on x86 microarchitectures. Until someone shows up with data showing that we need different tunings for different microarchitectures, it doesn't make sense for us to just make up numbers there. On the (very limited) microarchitectures we have and can test on, we're not seeing a need for microarchitectural tuning....

[PATCH][XENOPROFILE] add support for Intel CORE microarchitecture

2006 Oct 02

[PATCH][XENOPROFILE] add support for Intel CORE microarchitecture

This adds support for core and core2 chips. Tested on Woodcrest processors. Requires Oprofile 0.9.2. -Andrew Signed-off-by: Andrew Theurer <habanero@us.ibm.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel

Load combine pass

2016 Sep 29

Load combine pass

...the address sanitiser. Our architecture supports byte-granularity bounds checking in hardware. Note that even without this, for pure MIPS code without our extensions, load widening generates significantly worse code than when it doesn’t happen. I’m actually finding it difficult to come up with a microarchitecture where a 16-bit load followed by an 8-bit load from the same cache line would give worse performance than a 32-bit load, a mask and a shift. In an in-order design, it’s more instructions to do the same work, and therefore slower. In an out-of-order design, the two loads within the cache line will...

[AArch64] Address computation folding

2015 Nov 11

[AArch64] Address computation folding

Hi, I was looking at some AArch64 benchmarks and noticed some simple cases where addresses are being folded into the address mode computations and was curious as to why. In particular, consider the following simple example: void f2(unsigned long *x, unsigned long c) { x[c] *= 2; } This generates: lsl x8, x1, #3 ldr x9, [x0, x8] lsl x9, x9, #1 str x9, [x0, x8] Given the two

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

...nstructions have different implementations depending on which registers are assigned. This is well known for cases like `xor eax, eax` and `xor eax, ebx`, which emits no uops in the first case (this happens during register renaming, see Agner Fog’s “Register Allocation and Renaming”, in microarchitecture.pdf <http://www.agner.org/optimize/microarchitecture.pdf>). But we found out that this can go further. For example, SHLD64rri8 takes one cycle and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles and runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our...

[LLVMdev] SchedMachineModel clarifications

2013 Nov 13

[LLVMdev] SchedMachineModel clarifications

Dear Andrew and the Group, I’m trying come up with a SchedMachineModel for the AMD bulldozer http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture). The model is not exist for the same .Please correct me if am i wrong here. I was going through your reference @ https://llvm.org/svn/llvm-project/llvm/trunk/include/llvm/Target/TargetSchedule.td . But I couldn’t model some of the your definitions in the reference like a)Subtarget...

[LLVMdev] Generating movq2dq using IRBuilder

2008 Jul 31

[LLVMdev] Generating movq2dq using IRBuilder

...ors, MMX instructions are often emitted even when SSE3 is available. Is this really the intent or is it just that SSE versions of certain patterns have not been added, and therefore it falls back to MMX versions? It's not really encouraged to use MMX (or x87 for that matter) on modern microarchitectures if you can get away with SSE. -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

[AArch64] Address computation folding

2015 Nov 11

[AArch64] Address computation folding

...ldn't it consider the number of uses in any operation? The > > "expected" code is easy to get by checking the number of uses. This > > may be desirable on some micro-architectures depending on the cost of > > the various loads and stores. > > As you say, very microarchitecture-dependent. The code produced is > probably optimal for Cyclone ("[x0, x8]" is no more expensive than > "[x8]" and the "lsl" is slightly cheaper than the complicated "add"). > If I'm reading the Cortex-A57 optimisation guide correctly, the same &...

Why did Intel change his static branch prediction mechanism during these years?

2018 Aug 14

Why did Intel change his static branch prediction mechanism during these years?

...el the best static mechanism for Intel should be to clearly document his CPU "where I plan to go when dynamic predictor failed, forward or backward", because usually the programmer is the best guide at that time. APPENDIX: ¹ Agner's optimization guide: https://www.agner.org/optimize/microarchitecture.pdf , section 3.5 . ² Matt G's experiment: https://xania.org/201602/bpu-part-two

RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

2017 Nov 01

RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

...bit operations on some Intel CPUs may cause a decrease in CPU frequency that may offset the gains from using the wider register size. See section 15.26 of Intel® 64 and IA-32 Architectures Optimization Reference Manual published October 2017. -The vector ALUs on ports 0 and 1 of the Skylake Server microarchitecture are only 256-bits wide. 512-bit instructions using these ALUs must use both ports. See section 2.1 of Intel® 64 and IA-32 Architectures Optimization Reference Manual published October 2017. Implementation Plan: -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in X86.td not mapped to a...

Pattern transformation between scalar and vector on IR.

2016 Sep 08

Pattern transformation between scalar and vector on IR.

Hi All, I'm tring to use RSQRT instructions on follow case for ARM (now what using is sqrt): 1.0 / sqrt(x) The RSQRT instructions(VRSQRTE/VRSQRTS) are vector type, but above operation is scalar type. So a transformation must be done(transform sqrt pattern to rsqrt). I have completed a patch for this, but I made the transformation in the backend which will leads to additional

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

...rent implementationsdepending on which > registers are assigned. This is well known for cases like `xor > eax, eax`and `xor eax, ebx`, which emits no uops in the first case > (this happens during register renaming, see Agner Fog’s “Register > Allocation and Renaming”, in microarchitecture.pdf > <http://www.agner.org/optimize/microarchitecture.pdf>). But we > found out that this can go further. For example, SHLD64rri8takes > one cycle and runs on P06 in the `shld rax, rax, 0x1`case, but > takes 3 cycles and runs on P1 in the `shld rbx, rax, 0x1`case...

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

...ferent implementations depending on which > registers are assigned. This is well known for cases like `xor eax, > eax` and `xor eax, ebx`, which emits no uops in the first case (this > happens during register renaming, see Agner Fog’s “Register Allocation and > Renaming”, in microarchitecture.pdf > <http://www.agner.org/optimize/microarchitecture.pdf>). But we found > out that this can go further. For example, SHLD64rri8 takes one cycle > and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles > and runs on P1 in the `shld rbx, rax, 0x1` case....

[LLVMdev] Instruction MVT::ValueTypes

2008 Sep 03

[LLVMdev] Instruction MVT::ValueTypes

...xample, find code like this in X86InstrSSE.td: def : Pat<(alignedloadv2i64 addr:$src), (MOVAPSrm addr:$src)>, Requires<[HasSSE2]>; def : Pat<(loadv2i64 addr:$src), (MOVUPSrm addr:$src)>, Requires<[HasSSE2]>; and change it to not select MOVAPS for that microarchitecture, for example. Dan

[LLVMdev] oprofile support?

2014 Oct 17

[LLVMdev] oprofile support?

...time is 1413559198/1413559208 opjitconv: Ending with rc = 2. This code is usually OK, but can be useful for debugging purposes. JIT dump processing complete. operf-read process returned OK Profiling done. $ opreport Using /home/dad/oprofile_data/samples/ for samples directory. CPU: Intel Haswell microarchitecture, speed 3498 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 5000000 CPU_CLK_UNHALT...| samples| %| ------------------ 6949 100.000 lli CPU_CLK_UNHALT...| samples| %| ------------------...

[LLVMdev] SchedMachineModel clarifications

2013 Nov 21

[LLVMdev] SchedMachineModel clarifications

...comments on the implementation. Thanks ~umesh On Wed, Nov 13, 2013 at 8:14 PM, Umesh Kalappa <umesh.kalappa0 at gmail.com>wrote: > Dear Andrew and the Group, > > > > I’m trying come up with a SchedMachineModel for the AMD bulldozer http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture). > > > > The model is not exist for the same .Please correct me if am i wrong here. > > > > I was going through your reference @ > https://llvm.org/svn/llvm-project/llvm/trunk/include/llvm/Target/TargetSchedule.td > . > > > > But I couldn’t model some...

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

2018 Mar 15

[RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[LLVMdev] Static Profiling Algorithms in LLVM

2010 Nov 02

[LLVMdev] Static Profiling Algorithms in LLVM

...robabilities. The implementation also covers an intraprocedural and interprocedural frequency calculator for edges and functions. Reference: Youfeng Wu and James R. Larus. Static branch frequency and program profile analysis. In MICRO 27: Proceedings of the 27th annual international symposium on Microarchitecture. IEEE, 1994. Regards, Andrei On Tue, Nov 2, 2010 at 10:46 AM, Andrew Lenharth <andrewl at lenharth.org> wrote: > On Tue, Nov 2, 2010 at 12:28 AM, kapil anand <kapilanand2 at gmail.com> wrote: >> Hi all, >> >> Does LLVM infrastructure contain implementation...

[compiler-rt] Improve atomic locking?

2016 Dec 29

[compiler-rt] Improve atomic locking?

Hey, I am wondering if there wouldn't be more room for improving the locking of a pointer when an atomic operation is being made since I've noticed that one could increase the SPINLOCK_COUNT in lib/builtins/atomic.c to (1 << 13) which is a 8x increase of available locks if we also change the type of the atomic lock which currently is uintptr_t to a single byte (uint8_t) which I

_ExtInt, LLVM integers and constant time

2020 Apr 22

_ExtInt, LLVM integers and constant time

> On Apr 22, 2020, at 12:24 AM, Roman Lebedev via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > On Wed, Apr 22, 2020 at 9:35 AM Adrien Guinet via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> Hello everyone, >> >> After reading the nice blog post about _ExtInt, I was wondering whether >>

search for: microarchitectur