Michael Kruse via llvm-dev
2018-Nov-01 21:36 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at cray.com>>> > thank you for sharing the system hierarchy model. IMHO it makes a lot > > of sense, although I don't know which of today's passes would make use > > of it. Here are my remarks. > > LoopDataPrefetch would use it via the existing TTI interfaces, but I > think that's about it for now. It's a bit of a chicken-and-egg, in that > passes won't use it if it's not there and there's no push to get it in > because few things use it. :)What kind of passes is using it in the Cray compiler?> > I am wondering how one could model the following features using this > > model, or whether they should be part of a performance model at all: > > > > * ARM's big.LITTLE > > How is this modeled in the current AArch64 .td files? The current > design doesn't capture heterogeneity at all, not because we're not > interested but simply because our compiler captures that at a higher > level outside of LLVM.AFAIK it is not handled at all. Any architecture that supports big.LITTLE will return 0 on getCacheLineSize(). See AArch64Subtarget::initializeProperties().> > * write-back / write-through write buffers > > Do you mean for caches, or something else?https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies Basically, with write-though, every store is a non-temporal store (Or temporal stores being a write-through, depending on how to view it)> >> class TargetSoftwarePrefetcherInfo { > >> /// Should we do software prefetching at all? > >> /// > >> bool isEnabled() const; > > > > isEnabled sounds like something configurable at runtime. > > Currently we use it to allow some subtargets to do software prefetching > and prevent it for others. I see how the name could be confusing > though. Maybe ShouldDoPrefetching?isPrefetchingProfitable()? If it is a hardware property: isSupported() (ie. prefetch instruction would be a no-op on other implementations)> > Is there a way on which level the number of streams are shared? For > > instance, a core might be able to track 16 streams, but if 4 threads > > are running (SMT), each can only use 4. > > I suppose we could couple the streaming information to an execution > resource, similar to what is done with cache levels to express this kind > of sharing. We haven't found a need for it but that doesn't mean it > wouldn't be useful for other/new targets.The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the > > hardware which streams it should establish. Do the buffer counts > > include explicitly and automatically established streams? Do > > non-stream accesses (e.g. stack access) count towards > > It's up to the target maintainer to decide what the numbers mean. > Obviously passes have to have some notion of what things mean. The > thing that establishes what a "stream" is in the user program lives > outside of the system model. It may or may not consider random stack > accesses as part of a stream. > > This is definitely an area for exploration. Since we only have machines > with two major targets, we didn't need to contend with more exotic > things. :)IMHO it would be good if passes and targets agree on an interpretation of this number when designing the interface. Again, from the Blue Gene/Q: What counts as stream is configurable at runtime via a hardware register. It supports 3 settings: * Interpret every memory access as start of a stream * Interpret a stream when there are 2 consecutive cache misses * Only establish streams via dcbt instructions.> >> class TargetMemorySystemInfo { > >> const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const; > >> > >> /// getNumLevels - Return the number of cache levels this target has. > >> /// > >> unsigned getNumLevels() const; > >> > >> /// Cache level iterators > >> /// > >> cachelevel_iterator cachelevel_begin() const; > >> cachelevel_iterator cachelevel_end() const; > > > > May users of this class assume that a level refers to a specific > > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to > > search for a cache of a specific size? > > The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is > the L2 cache and so on.Can passes rely on it?> >> //===--------------------------------------------------------------------===// > >> // Stream Buffer Information > >> // > >> const TargetStreamBufferInfo *getStreamBufferInfo() const; > >> > >> //===--------------------------------------------------------------------===// > >> // Software Prefetcher Information > >> // > >> const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const; > > > > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache > > level? Some ISA have multiple prefetchers/prefetch instructructions > > for different levels. > > Probably. Most X86 implementations direct all data prefetches to the > same cache level so we didn't find a need to model this, but it makes > sense to allow for it.Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for prefetch), but a dcbt instruction is necessary to establish the cache line into the L1 cache.> >> An open question is how to handle different SKUs within a subtarget > >> family. We modeled the limited number of SKUs used in our products > >> via multiple subtargets, so this wasn't a heavy burden for us, but a > >> more robust implementation might allow for multiple ``MemorySystem`` > >> and/or ``ExecutionEngine`` models for a given subtarget. It's not yet > >> clear whether that's a good/necessary thing and if it is, how to > >> specify it with a compiler switch. ``-mcpu=shy-enigma > >> -some-switch-to-specify-memory-and-execution-models``? It may very > >> well be sufficient to have a general system model that applies > >> relatively well over multiple SKUs. > > > > Adding more specific subtargets with more refined execution models > > seem fine for me. > > But is it reasonable to manage a database of all processors ever > > produced in the compiler? > > No it is not. :) That's why this is an open question. We've found it > perfectly adequate to define a single model for each major processor > generation, but as I said we use a limited number of SKUs. We will > need input from the community on this.Independently on whether subtargets for SKUs are added, could we (also) be able to define these parameters via the command line. Like xlc's -qcache option. Michael
David Greene via llvm-dev
2018-Nov-01 21:55 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Michael Kruse via llvm-dev <llvm-dev at lists.llvm.org> writes:> Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at cray.com>> >> > thank you for sharing the system hierarchy model. IMHO it makes a lot >> > of sense, although I don't know which of today's passes would make use >> > of it. Here are my remarks. >> >> LoopDataPrefetch would use it via the existing TTI interfaces, but I >> think that's about it for now. It's a bit of a chicken-and-egg, in that >> passes won't use it if it's not there and there's no push to get it in >> because few things use it. :) > > What kind of passes is using it in the Cray compiler?Not sure how much I can say about that, unfortunately.>> > I am wondering how one could model the following features using this >> > model, or whether they should be part of a performance model at all: >> > >> > * ARM's big.LITTLE >> >> How is this modeled in the current AArch64 .td files? The current >> design doesn't capture heterogeneity at all, not because we're not >> interested but simply because our compiler captures that at a higher >> level outside of LLVM. > > AFAIK it is not handled at all. Any architecture that supports > big.LITTLE will return 0 on getCacheLineSize(). See > AArch64Subtarget::initializeProperties().Ok. I would like to start posting patches for review without speculating too much on fancy/exotic things that may come later. We shouldn't do anything that precludes extensions but I don't want to get bogged down in a lot of details on things related to a small number of targets. Let's get the really common stuff in first. What do you think?>> > * write-back / write-through write buffers >> >> Do you mean for caches, or something else? > > https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies > > Basically, with write-though, every store is a non-temporal store (Or > temporal stores being a write-through, depending on how to view it)A write-through store isn't the same things as a non-temporal store, at least in my understanding of the term from X86 and AArch64. A non-temporal store bypasses the cache entirely. I'm struggling a bit to understand how a compiler would make use of the cache's write-back policy.>> >> class TargetSoftwarePrefetcherInfo { >> >> /// Should we do software prefetching at all? >> >> /// >> >> bool isEnabled() const; >> > >> > isEnabled sounds like something configurable at runtime. >> >> Currently we use it to allow some subtargets to do software prefetching >> and prevent it for others. I see how the name could be confusing >> though. Maybe ShouldDoPrefetching? > > isPrefetchingProfitable()?Sounds good.> If it is a hardware property: > isSupported() > (ie. prefetch instruction would be a no-op on other implementations)Oh, I hadn't even thought of that possibility.>> > Is there a way on which level the number of streams are shared? For >> > instance, a core might be able to track 16 streams, but if 4 threads >> > are running (SMT), each can only use 4. >> >> I suppose we could couple the streaming information to an execution >> resource, similar to what is done with cache levels to express this kind >> of sharing. We haven't found a need for it but that doesn't mean it >> wouldn't be useful for other/new targets. > > The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.Ok.>> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the >> > hardware which streams it should establish. Do the buffer counts >> > include explicitly and automatically established streams? Do >> > non-stream accesses (e.g. stack access) count towards >> >> It's up to the target maintainer to decide what the numbers mean. >> Obviously passes have to have some notion of what things mean. The >> thing that establishes what a "stream" is in the user program lives >> outside of the system model. It may or may not consider random stack >> accesses as part of a stream. >> >> This is definitely an area for exploration. Since we only have machines >> with two major targets, we didn't need to contend with more exotic >> things. :) > > IMHO it would be good if passes and targets agree on an interpretation > of this number when designing the interface.Of course.> Again, from the Blue Gene/Q: What counts as stream is configurable at > runtime via a hardware register. It supports 3 settings: > * Interpret every memory access as start of a stream > * Interpret a stream when there are 2 consecutive cache misses > * Only establish streams via dcbt instructions.I think we're interpreting "streaming" differently. In this design, a "stream" is a sequence of memory operations that should bypass the cache because the data will never be reused (at least not in a timely manner). On X86 processor the compiler has complete software control over streaming through the use of movnt instructions. AArch64 has a similar, though very restricted, capability until SVE. dcbt is more like a prefetch than a movnt, right? It sounds like BG/Q has a hardware prefetcher configurable by software. I think that would fit better under a completely different resource type. The compiler's use of dcbt would be guided by TargetSoftwarePrefetcherInfo which could be extended to represent BG/Q's configurable hardware prefetcher.>> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is >> the L2 cache and so on. > > Can passes rely on it?Yes.>> Probably. Most X86 implementations direct all data prefetches to the >> same cache level so we didn't find a need to model this, but it makes >> sense to allow for it. > > Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for > prefetch), but a dcbt instruction is necessary to establish the cache > line into the L1 cache.Yep, makes sense.>> > Adding more specific subtargets with more refined execution models >> > seem fine for me. But is it reasonable to manage a database of all >> > processors ever produced in the compiler? >> >> No it is not. :) That's why this is an open question. We've found it >> perfectly adequate to define a single model for each major processor >> generation, but as I said we use a limited number of SKUs. We will >> need input from the community on this. > > Independently on whether subtargets for SKUs are added, could we > (also) be able to define these parameters via the command line. Like > xlc's -qcache option.I think that would be very useful. -David
Renato Golin via llvm-dev
2018-Nov-02 20:06 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Hey, I've been reading back the thread and there's a lot of ideas flying around, I may have missed more than I should, but here's my view on it. First, I think this is a good idea. Mapping caches is certainly interesting to general architectures, but particularly important to massive operations like matrix multiply and stencils can pull a lot of data into cache and sometimes thrash it if not careful. With scalable and larger vectors, this will be even more important. Overall, I think this is a good idea, but the current proposal is too detailed on the implementation and not enough on the use for me to have a good idea how and where this will be used. Can you describe a few situations where these new interfaces would be used and how? Some comments inline. On Thu, 1 Nov 2018 at 21:56, David Greene via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Ok. I would like to start posting patches for review without > speculating too much on fancy/exotic things that may come later. We > shouldn't do anything that precludes extensions but I don't want to get > bogged down in a lot of details on things related to a small number of > targets. Let's get the really common stuff in first. What do you > think?In theory, both big and little cores should have the same cache structure, so we don't necessarily need extra descriptions for both. In practice, sub-architectures can have multiple combinations of big.LITTLE cores and it's simply not practical to add that to table-gen. -- cheers, --renato
Michael Kruse via llvm-dev
2018-Nov-02 21:16 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Am Do., 1. Nov. 2018 um 16:56 Uhr schrieb David Greene <dag at cray.com>:> Ok. I would like to start posting patches for review without > speculating too much on fancy/exotic things that may come later. We > shouldn't do anything that precludes extensions but I don't want to get > bogged down in a lot of details on things related to a small number of > targets. Let's get the really common stuff in first. What do you > think?I agree.> > Again, from the Blue Gene/Q: What counts as stream is configurable at > > runtime via a hardware register. It supports 3 settings: > > * Interpret every memory access as start of a stream > > * Interpret a stream when there are 2 consecutive cache misses > > * Only establish streams via dcbt instructions. > > I think we're interpreting "streaming" differently. In this design, a > "stream" is a sequence of memory operations that should bypass the cache > because the data will never be reused (at least not in a timely manner).I understood "stream" as "prefetch stream", something that prefetch the data for an access A[i] in a for-loop. I'd call "bypassing the cache because the data will never be reuse" a non-temporal memory access. In the the latter interpretation, what does "number of streams" mean? AFAIU the processer will just queue memory operations (e.g. for writing to RAM). Is it the maximum number of operations in the queue?> >> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is > >> the L2 cache and so on. > > > > Can passes rely on it? > > Yes.Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2, i.e. the L1P would be getCacheLevel(1) and getCacheLevel(2) would be L2. How would you model it instead? Michael