search for: icach

Displaying 20 results from an estimated 112 matches for "icach".

Did you mean: icache
2010 Nov 30
0
[LLVMdev] LLVM Inliner
On Nov 30, 2010, at 2:19 PM, Xinliang David Li wrote: > I understand that, but that implies that you have some model for code locality. Setting a global code growth limit is (in my opinion) a hack unless you are aiming for the whole program to fit in the icache (which I don't think anyone tries to do :). > > Yes, global growth limit may be good for size control, but is a hack for control icache footprint. However, as I mentioned, the bottom up inline scheme make it impossible to use any heuristics involving 'global limit' which can be...
2010 Nov 30
3
[LLVMdev] LLVM Inliner
...n't scale for the compiler workload -- i.e. get 2x speedup using 8 core. > > I understand that, but that implies that you have some model for code > locality. Setting a global code growth limit is (in my opinion) a hack > unless you are aiming for the whole program to fit in the icache (which I > don't think anyone tries to do :). > > Yes, global growth limit may be good for size control, but is a hack for control icache footprint. However, as I mentioned, the bottom up inline scheme make it impossible to use any heuristics involving 'global limit' which ca...
2010 Nov 30
2
[LLVMdev] LLVM Inliner
...wrote: > > On Nov 30, 2010, at 2:19 PM, Xinliang David Li wrote: > > I understand that, but that implies that you have some model for code >> locality. Setting a global code growth limit is (in my opinion) a hack >> unless you are aiming for the whole program to fit in the icache (which I >> don't think anyone tries to do :). >> > > Yes, global growth limit may be good for size control, but is a hack for > control icache footprint. However, as I mentioned, the bottom up inline > scheme make it impossible to use any heuristics involving 'glob...
2010 Nov 30
0
[LLVMdev] LLVM Inliner
On Nov 30, 2010, at 2:36 PM, Xinliang David Li wrote: >> Yes, global growth limit may be good for size control, but is a hack for control icache footprint. However, as I mentioned, the bottom up inline scheme make it impossible to use any heuristics involving 'global limit' which can be more complicated and fancier than the simple growth limit. For instance, there is no restriction that only one global limit can be used --- the c...
2010 Nov 29
0
[LLVMdev] LLVM Inliner
...as a hack to prevent run-away inlining) so not visiting in priority order shouldn't prevent high-priority-but-processed-late candidates from being inlined. > > global threshold can be used to control the unnecessary size growth. In some cases, the size increase may also cause increase in icache footprint leading to poor performance. In fact, with IPO/CMO, icache footprint can be modeled in some way and be used as one kind of global limit. I understand that, but that implies that you have some model for code locality. Setting a global code growth limit is (in my opinion) a hack unless...
2015 May 06
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
...dropped from >> about 600 to 500 cycles. >> >> spin_unlock() for first time dropped from 145 to 87 cycles. >> >> spin_lock() in a loop dropped from 48 to 45 cycles. >> >> spin_unlock() in the same loop dropped from 24 to 22 cycles. > > Did you isolate icache hot/cold from dcache hot/cold? It seems to me the > main difference will be whether the branch predictor is warmed up rather > than if the lock itself is in dcache, but its much more likely that the > lock code is icache if the code is lock intensive, making the cold case > moot. But t...
2015 May 06
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
...dropped from >> about 600 to 500 cycles. >> >> spin_unlock() for first time dropped from 145 to 87 cycles. >> >> spin_lock() in a loop dropped from 48 to 45 cycles. >> >> spin_unlock() in the same loop dropped from 24 to 22 cycles. > > Did you isolate icache hot/cold from dcache hot/cold? It seems to me the > main difference will be whether the branch predictor is warmed up rather > than if the lock itself is in dcache, but its much more likely that the > lock code is icache if the code is lock intensive, making the cold case > moot. But t...
2014 Oct 24
3
[PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
...NULL; >> + pv_init_node(node); >> >> /* >> * We touched a (possibly) cold cacheline in the per-cpu queue node; > > So even if !pv_enabled() the compiler will still have to emit the code > for that inline, which will generate additional register pressure, > icache pressure and lovely stuff like that. > > The patch I had used pv-ops for these things that would turn into NOPs > in the regular case and callee-saved function calls for the PV case. > > That still does not entirely eliminate cost, but does reduce it > significant. Please conside...
2014 Oct 24
3
[PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
...NULL; >> + pv_init_node(node); >> >> /* >> * We touched a (possibly) cold cacheline in the per-cpu queue node; > > So even if !pv_enabled() the compiler will still have to emit the code > for that inline, which will generate additional register pressure, > icache pressure and lovely stuff like that. > > The patch I had used pv-ops for these things that would turn into NOPs > in the regular case and callee-saved function calls for the PV case. > > That still does not entirely eliminate cost, but does reduce it > significant. Please conside...
2015 May 04
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 04/30/2015 06:39 PM, Jeremy Fitzhardinge wrote: > On 04/30/2015 03:53 AM, Juergen Gross wrote: >> Paravirtualized spinlocks produce some overhead even if the kernel is >> running on bare metal. The main reason are the more complex locking >> and unlocking functions. Especially unlocking is no longer just one >> instruction but so complex that it is no longer inlined.
2015 May 04
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 04/30/2015 06:39 PM, Jeremy Fitzhardinge wrote: > On 04/30/2015 03:53 AM, Juergen Gross wrote: >> Paravirtualized spinlocks produce some overhead even if the kernel is >> running on bare metal. The main reason are the more complex locking >> and unlocking functions. Especially unlocking is no longer just one >> instruction but so complex that it is no longer inlined.
2010 Nov 29
3
[LLVMdev] LLVM Inliner
...hack to prevent run-away inlining) so not visiting > in priority order shouldn't prevent high-priority-but-processed-late > candidates from being inlined. > global threshold can be used to control the unnecessary size growth. In some cases, the size increase may also cause increase in icache footprint leading to poor performance. In fact, with IPO/CMO, icache footprint can be modeled in some way and be used as one kind of global limit. > > The only potential issue I'm aware of is if we have A->B->C and we decide > to inline C into B when it would be more profitabl...
2016 Mar 10
2
[RFC] Target-specific parametrization of function inliner
...of call/return pairs to help branch prediction of ret instructions -- such stack has a target specific limit which can be triggered when a callsite is deep in the callchain. Register file size and register pressure increase due to inline comes as another example. Another relevant example is the icache/itlb sizes. To do a more precise analysis of the cost to 'speed' due to icache/itlb pressure increase requires target information, profile information as well as some global analysis. Easwaran has done some research in this area in the past and can share the analysis design when other thin...
2017 Jan 30
4
(RFC) Adjusting default loop fully unroll threshold
Currently, loop fully unroller shares the same default threshold as loop dynamic unroller and partial unroller. This seems conservative because unlike dynamic/partial unrolling, fully unrolling will not affect LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed to double the threshold for loop fully unroller. This will change the codegen of several SPECCPU benchmarks: Code size: 447.dealII 0.50% 453.povray 0.42% 433.milc 0.20% 445.gobmk 0.32% 403.gcc 0.05% 464.h264ref 3.62% Compile Time: 447.d...
2011 Jun 16
2
[LLVMdev] LLVM-based address sanity checker
...it to 1 byte (at least on x86/x86_64) with some more work. http://code.google.com/p/address-sanitizer/wiki/AddressSanitizerAlgorithm#Report_Error My first attempt that used no asm required ~15 bytes of code. Note, this code is executed only once, so it affects the performance very slightly (through icache size). > > The run-time library being 1.5k loc is not encouraging, but it didn't > look particularly platform specific... > Alas. It will grow even more when we add MacOS support. (currently, only tiny tests work on Mac). --kcc > > cheers, > --renato > ------------...
2020 Aug 05
10
[RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data
...the basic block sections feature recently introduced in LLVM from the Propeller project. The pass targets functions with profile coverage, identifies cold blocks and moves them to a separate section. The linker groups all cold blocks across functions together, decreasing fragmentation and improving icache and itlb utilization. Our experiments show >2% performance improvement on clang bootstrap, ~1% improvement on Google workloads and 1.6% mean performance improvement on SPEC IntRate 2017. Motivation Recent work at Google has shown that aggressive, profile-driven inlining for performance has led...
2014 Oct 27
2
[PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On 10/24/2014 06:04 PM, Peter Zijlstra wrote: > On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote: >> The additional register pressure may just cause a few more register moves >> which should be negligible in the overall performance . The additional >> icache pressure, however, may have some impact on performance. I was trying >> to balance the performance of the pv and non-pv versions so that we won't >> penalize the pv code too much for a bit more performance in the non-pv code. >> Doing it your way will add a lot of function ca...
2014 Oct 27
2
[PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On 10/24/2014 06:04 PM, Peter Zijlstra wrote: > On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote: >> The additional register pressure may just cause a few more register moves >> which should be negligible in the overall performance . The additional >> icache pressure, however, may have some impact on performance. I was trying >> to balance the performance of the pv and non-pv versions so that we won't >> penalize the pv code too much for a bit more performance in the non-pv code. >> Doing it your way will add a lot of function ca...
2018 Nov 15
3
[cfe-dev] [RFC][ARM] -Oz implies -mthumb
I've never tried -mcpu=cortex-xyz but I know -march=armv7 defaults to Thumb OK, I just checked, and -mcpu=cortex-{m3,m4,m7,a7,a9,a15,a53} gives Thumb at -O1, -O1, -Os on the following gcc: arm-linux-gnueabihf-gcc (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0 cortex-m0 fails because it doesn't do hard float. I don't have an eabi compiler around. On Thu, Nov 15, 2018 at 4:14 AM, Tim
2012 Sep 11
3
[LLVMdev] Need Help Understanding Operands in X86 MachineFunctionPass
Dear All, I'm working on an X86 MachineFunctionPass that adds prefetch instructions to a function. I have code that adds a "prefetchnta <constant address>" instruction to x86 32-bit code. What I want to do is to add a "prefetchnta <constant address>" instruction to x86_64 code. The code for adding the 32-bit instruction is: