search for: itlb

Displaying 20 results from an estimated 143 matches for "itlb".

Did you mean: iotlb
2016 Mar 10
2
[RFC] Target-specific parametrization of function inliner
...l/return pairs to help branch prediction of ret instructions -- such stack has a target specific limit which can be triggered when a callsite is deep in the callchain. Register file size and register pressure increase due to inline comes as another example. Another relevant example is the icache/itlb sizes. To do a more precise analysis of the cost to 'speed' due to icache/itlb pressure increase requires target information, profile information as well as some global analysis. Easwaran has done some research in this area in the past and can share the analysis design when other things are...
2017 Jul 31
1
[RFC] Profile guided section layout
...ain Clustering (C³) heuristic from > https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf > . In the programs I've tested it on I've gotten from 0% to 5% performance > improvement over standard PGO with zero cases of slowdowns and up to 15% > reduction in ITLB misses. > > > There are three parts to this implementation. > > The first is a new llvm pass which uses branch frequency info to get counts > for each call instruction and then adds a module flags metatdata table of > function -> function edges along with their counts. >...
2020 Aug 05
10
[RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data
...lock sections feature recently introduced in LLVM from the Propeller project. The pass targets functions with profile coverage, identifies cold blocks and moves them to a separate section. The linker groups all cold blocks across functions together, decreasing fragmentation and improving icache and itlb utilization. Our experiments show >2% performance improvement on clang bootstrap, ~1% improvement on Google workloads and 1.6% mean performance improvement on SPEC IntRate 2017. Motivation Recent work at Google has shown that aggressive, profile-driven inlining for performance has led to signif...
2017 Jun 12
3
[RFC] Pagerando: Page-granularity code randomization
I could understand a TLB hit if functions that originally happened to be on the same page were spread across many pages, raising the iTLB footprint for a given loop, etc. (reduced spatial locality). For pagerando, since we're splitting on 4k page boundaries and can keep spatial locality (or attempt to improve it), I'm not sure that TLB misses will be a large factor. I expect that the runtime overhead of inter-page indirection...
2014 Nov 19
5
[LLVMdev] Odd code layout requirements for MCJIT
...it's needed for execution, so the physical layout of the generated code looks like the execution path taken the first time the code is run (we're actually a little smarter than that now, but this description is good enough for the problem at hand). We're under pretty severe icache/iTLB pressure, so we do whatever we can to keep the hot path as compact as possible. One of the ways we do this is by dividing our code cache into three fixed-size areas: main, cold, and frozen. Our current, non-llvm codegen backend has one area tag per basic block, and most tracelets we compile wil...
2016 Feb 27
2
Possible soundness issue with available_externally (split from "RFC: Add guard intrinsics")
...gnal methods to reduce > the overall cost of the former. > > > It is, and I agree. However, the code-size growth seems like it could > easily be unacceptable even for speed-optimized builds without the kind of > mitigation options mentioned. > Except that there might be icache/itlb impact so performance is not totally immune to size increase -- otherwise we may as well just force inlining those :) David > > -Hal > > > David > >> >> -Chandler >> > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leade...
2020 Aug 10
2
[RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data
...lock sections feature recently introduced in LLVM from the Propeller project. The pass targets functions with profile coverage, identifies cold blocks and moves them to a separate section. The linker groups all cold blocks across functions together, decreasing fragmentation and improving icache and itlb utilization. Our experiments show >2% performance improvement on clang bootstrap, ~1% improvement on Google workloads and 1.6% mean performance improvement on SPEC IntRate 2017. Motivation Recent work at Google has shown that aggressive, profile-driven inlining for performance has led to signi...
2016 Mar 10
3
[RFC] Target-specific parametrization of function inliner
IMO, the appropriate thing for TTI to inform the inliner about is how costly the actual act of a "call" is likely to be. I would hope that this would only be used on targets where there is some really dramatic overhead of actually doing a function call such that the code size cost incurred by inlining is completely dwarfed by the improvements. GPUs are one of the few platforms that
2020 Aug 05
3
[RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data
...duced in LLVM from the Propeller >> project. The pass targets functions with profile coverage, identifies cold >> blocks and moves them to a separate section. The linker groups all cold >> blocks across functions together, decreasing fragmentation and improving >> icache and itlb utilization. Our experiments show >2% performance >> improvement on clang bootstrap, ~1% improvement on Google workloads and >> 1.6% mean performance improvement on SPEC IntRate 2017. >> Motivation >> >> Recent work at Google has shown that aggressive, profile-drive...
2017 Jun 15
7
[RFC] Profile guided section layout
...ld using the Call-Chain Clustering (C³) heuristic from https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf . In the programs I've tested it on I've gotten from 0% to 5% performance improvement over standard PGO with zero cases of slowdowns and up to 15% reduction in ITLB misses. There are three parts to this implementation. The first is a new llvm pass which uses branch frequency info to get counts for each call instruction and then adds a module flags metatdata table of function -> function edges along with their counts. The second takes the module flags me...
2016 Mar 02
3
[RFC] Target-specific parametrization of function inliner
Hi, I propose to make function inliner parameters adjustable for specific target. Currently function inlining pass appears to be target-agnostic with various constants for calculating call cost hardcoded. While it works reasonably well for general purpose CPUs, some quirkier targets like NVPTX would benefit from target-specific tuning. Currently it appears that there are two things that need to
2016 Feb 27
0
Possible soundness issue with available_externally (split from "RFC: Add guard intrinsics")
...l cost of the former. > > > > > It is, and I agree. However, the code-size growth seems like it > > could > > easily be unacceptable even for speed-optimized builds without the > > kind of mitigation options mentioned. > > Except that there might be icache/itlb impact so performance is not > totally immune to size increase -- otherwise we may as well just > force inlining those :) This is why I am hoping we can aggressively apply that kind of privatization. However, it might turn out not to be the case. -Hal > David > > -Hal > &gt...
2018 Aug 02
2
Vectorizing remainder loop
Hi Hameeza, Aside from Ashutosh's patch..... When the vector width is that large, we can't keep vectorizing remainder like below. It'll be a huge code size if nothing else ---- hitting ITLB miss because of this is very bad, for example. VF=2048 // main vector loop VF=1024 // vectorized remainder 1 VF=512 // vectorized remainder 2 ... Vectorize remainder until trip count is small enough for scalar execution. Direction #1 Does your HW support efficient masking? If so, the first...
2017 Jun 12
2
[RFC] Pagerando: Page-granularity code randomization
On Mon, Jun 12, 2017 at 1:03 PM, Stephen Crane <sjc at immunant.com> wrote: > I don't have performance measurements for the new LTO version of > pagerando yet. I'll definitely be thoroughly measuring performance > once the current prototype is finished before moving forward, and will > post results when I have them. > > I'm definitely curious about your work
2016 Apr 01
2
[RFC] Target-specific parametrization of function inliner
...reshold (adjusted from a base value), then the callsite is considered an inline candidate. In most cases, the decision is made locally due to the bottom-up order (there are tweaks to bypass it). The size/cost can be remotely tied and serves a proxy to represent the real runtime cost due to icache/itlb effect, but it seems the size/threshold scheme is mainly used to model the runtime speedup vs compile time/binary size tradeoffs. Other than the call cost itself, I've been surprised that the TTI is not more involved when it comes to this tradeoff: instructions don't have the same tradeo...
2017 Apr 21
1
Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
...d news: It isn't any better than 4.9.13 was for me either, if I don't set vcpu limit in the grub/xen config, it still panics like so: [ 6.716016] CPU: Physical Processor ID: 0 [ 6.720199] CPU: Processor Core ID: 0 [ 6.724046] mce: CPU supports 2 MCE banks [ 6.728239] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 [ 6.733884] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0 [ 6.740770] Freeing SMP alternatives memory: 32K (ffffffff821a8000 - ffffffff821b0000) [ 6.750638] ftrace: allocating 34344 entries in 135 pages [ 6.771888] smpboot: Max logical packages:...
2016 Feb 27
1
Possible soundness issue with available_externally (split from "RFC: Add guard intrinsics")
...t; It is, and I agree. However, the code-size growth seems like it > > > could > > > easily be unacceptable even for speed-optimized builds without > > > the > > > kind of mitigation options mentioned. > > > > > Except that there might be icache/itlb impact so performance is not > > totally immune to size increase -- otherwise we may as well just > > force inlining those :) > > This is why I am hoping we can aggressively apply that kind of > privatization. However, it might turn out not to be the case. But to be clear, my...
2017 Apr 19
2
Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
On 04/19/2017 12:18 PM, PJ Welsh wrote: > > On Wed, Apr 19, 2017 at 5:40 AM, Johnny Hughes <johnny at centos.org > <mailto:johnny at centos.org>> wrote: > > On 04/18/2017 12:39 PM, PJ Welsh wrote: > > Here is something interesting... I went through the BIOS options and > > found that one R710 that *is* functioning only differed in that
2018 Aug 03
2
Vectorizing remainder loop
...AM, Saito, Hideki <hideki.saito at intel.com<mailto:hideki.saito at intel.com>> wrote: Hi Hameeza, Aside from Ashutosh's patch..... When the vector width is that large, we can't keep vectorizing remainder like below. It'll be a huge code size if nothing else ---- hitting ITLB miss because of this is very bad, for example. VF=2048 // main vector loop VF=1024 // vectorized remainder 1 VF=512 // vectorized remainder 2 ... Vectorize remainder until trip count is small enough for scalar execution. Direction #1 Does your HW support e...
2017 Apr 21
0
Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
...er than 4.9.13 was for me either, if I don't > set vcpu limit in the grub/xen config, it still panics like so: > > [ 6.716016] CPU: Physical Processor ID: 0 > [ 6.720199] CPU: Processor Core ID: 0 > [ 6.724046] mce: CPU supports 2 MCE banks > [ 6.728239] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 > [ 6.733884] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0 > [ 6.740770] Freeing SMP alternatives memory: 32K (ffffffff821a8000 - > ffffffff821b0000) > [ 6.750638] ftrace: allocating 34344 entries in 135 pages > [ 6.771888] smpboo...