Michael Kruse via llvm-dev
2018-Nov-02 21:16 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Am Do., 1. Nov. 2018 um 16:56 Uhr schrieb David Greene <dag at cray.com>:> Ok. I would like to start posting patches for review without > speculating too much on fancy/exotic things that may come later. We > shouldn't do anything that precludes extensions but I don't want to get > bogged down in a lot of details on things related to a small number of > targets. Let's get the really common stuff in first. What do you > think?I agree.> > Again, from the Blue Gene/Q: What counts as stream is configurable at > > runtime via a hardware register. It supports 3 settings: > > * Interpret every memory access as start of a stream > > * Interpret a stream when there are 2 consecutive cache misses > > * Only establish streams via dcbt instructions. > > I think we're interpreting "streaming" differently. In this design, a > "stream" is a sequence of memory operations that should bypass the cache > because the data will never be reused (at least not in a timely manner).I understood "stream" as "prefetch stream", something that prefetch the data for an access A[i] in a for-loop. I'd call "bypassing the cache because the data will never be reuse" a non-temporal memory access. In the the latter interpretation, what does "number of streams" mean? AFAIU the processer will just queue memory operations (e.g. for writing to RAM). Is it the maximum number of operations in the queue?> >> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is > >> the L2 cache and so on. > > > > Can passes rely on it? > > Yes.Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2, i.e. the L1P would be getCacheLevel(1) and getCacheLevel(2) would be L2. How would you model it instead? Michael
David Greene via llvm-dev
2018-Nov-05 16:25 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Michael Kruse <llvmdev at meinersbur.de> writes:>> > Again, from the Blue Gene/Q: What counts as stream is configurable at >> > runtime via a hardware register. It supports 3 settings: >> > * Interpret every memory access as start of a stream >> > * Interpret a stream when there are 2 consecutive cache misses >> > * Only establish streams via dcbt instructions. >> >> I think we're interpreting "streaming" differently. In this design, a >> "stream" is a sequence of memory operations that should bypass the cache >> because the data will never be reused (at least not in a timely manner). > > I understood "stream" as "prefetch stream", something that prefetch > the data for an access A[i] in a for-loop. > > I'd call "bypassing the cache because the data will never be reuse" a > non-temporal memory access.Yes, I agree the terminology is confusing. I used the term "stream" in the sense of stream processing (https://en.wikipedia.org/wiki/Stream_processing). The programming model is very different, of course, but the idea of a stream of data that is acted upon and then essentially discarded is similar.> In the the latter interpretation, what does "number of streams" mean? > AFAIU the processer will just queue memory operations (e.g. for > writing to RAM). Is it the maximum number of operations in the queue?On X86 NT writes are to so-called "write-combined memory." Hardware-wise, that translates to some number of buffers where stores are collected and merged, resulting in a (hopefully) single cache line-sized write to memory that aggregates some number of individual stores. The number of buffers limits the number of independent sets of stores that can be active. For example, if the hardware has four such buffers, I can do this and be fine: for (...) A[i] = ... B[i] = ... C[i] = ... D[i] = ... The sequence of (non-temporal) stores to each array will map to separate hardware buffers and be write-combined in the way one would expect. But this causes a problem: for (...) A[i] = ... B[i] = ... C[i] = ... D[i] = ... E[i] = ... If all of the stores are non-temporal, then at least two of the array store sequences will interfere with each other in the write-combining buffers and will force early flushes of the buffer, effectively turning them into single-store writes to memory. That's bad. Maybe the proper name for this concept is simply "WriteCombiningBuffer." I'm not sure if some other architecture might have a concept of store buffers that does something other than write-combining, so I was trying to use a fairly generic name to mean, "some compiler-controlled hardware buffer." There's a similar concept on the load side though I don't know if any existing processors actually implement things that way. I know of (academic) architectures where prefetches fill independent prefetch buffers and one wouldn't want to prefetch too many different things because they would start filling each others' buffers. That kind of behavior could be captured by this model. The key factor is contention. There's a limited hardware memory buffer resource and compilers have to be careful not to oversubscribe it. I don't know what the right name for it is. Probably there will be more than one such type of resource for some architectures, so for now maybe we just model write-combining buffers and leave it at that. If other such resources pop up we can model them with different names. I think that's what we should do.>> >> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is >> >> the L2 cache and so on. >> > >> > Can passes rely on it? >> >> Yes. > > Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2, > i.e. the L1P would be getCacheLevel(1) and getCacheLevel(2) would be > L2. How would you model it instead?That seems ok to me. As I understand it, L1P is a little awkward in that L2 data doesn't get moved to L1P, it gets moved to L1. L1P is really a prefetch buffer, right? One wouldn't do, say, cache blocking for L1P. In that sense maybe modeling it as a cache level isn't the right thing. How does software make use of L1P? I understand compilers can insert data prefetches and the data resides in L1P, presumably until it is accessed and then it moves to L1. I suppose the size of L1P could determine how aggressively compilers prefetch. Is that the idea or are you thinking of something else? -David
Michael Kruse via llvm-dev
2018-Nov-07 23:26 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Am Mo., 5. Nov. 2018 um 10:26 Uhr schrieb David Greene <dag at cray.com>:> Yes, I agree the terminology is confusing. I used the term "stream" in > the sense of stream processing (https://en.wikipedia.org/wiki/Stream_processing). > The programming model is very different, of course, but the idea of a > stream of data that is acted upon and then essentially discarded is > similar. > > > In the the latter interpretation, what does "number of streams" mean? > > AFAIU the processer will just queue memory operations (e.g. for > > writing to RAM). Is it the maximum number of operations in the queue? > > On X86 NT writes are to so-called "write-combined memory." > Hardware-wise, that translates to some number of buffers where stores > are collected and merged, resulting in a (hopefully) single cache > line-sized write to memory that aggregates some number of individual > stores. The number of buffers limits the number of independent sets of > stores that can be active. For example, if the hardware has four such > buffers, I can do this and be fine: > > for (...) > A[i] = ... > B[i] = ... > C[i] = ... > D[i] = ... > > The sequence of (non-temporal) stores to each array will map to separate > hardware buffers and be write-combined in the way one would expect. But > this causes a problem: > > for (...) > A[i] = ... > B[i] = ... > C[i] = ... > D[i] = ... > E[i] = ... > > If all of the stores are non-temporal, then at least two of the array > store sequences will interfere with each other in the write-combining > buffers and will force early flushes of the buffer, effectively turning > them into single-store writes to memory. That's bad. > > Maybe the proper name for this concept is simply "WriteCombiningBuffer." > I'm not sure if some other architecture might have a concept of store > buffers that does something other than write-combining, so I was trying > to use a fairly generic name to mean, "some compiler-controlled hardware > buffer." > > There's a similar concept on the load side though I don't know if any > existing processors actually implement things that way. I know of > (academic) architectures where prefetches fill independent prefetch > buffers and one wouldn't want to prefetch too many different things > because they would start filling each others' buffers. That kind of > behavior could be captured by this model. > > The key factor is contention. There's a limited hardware memory buffer > resource and compilers have to be careful not to oversubscribe it. I > don't know what the right name for it is. Probably there will be more > than one such type of resource for some architectures, so for now maybe > we just model write-combining buffers and leave it at that. If other > such resources pop up we can model them with different names. > > I think that's what we should do.Thank you for the detailed explanation. We could use a notion of "sustainable stream", i.e. the maximum number of (consecutive?) read/write streams that a processor can support before a disproportional loss in performance happens. This is oblivious to the reason why that performance loss happens, be it write combining buffers or prefetch streams. If there multiple such bottlenecks, it would be the minimum of such streams. At the moment I cannot think of an optimization where the difference matters (which doesn't mean there isn't a case where it does).> That seems ok to me. As I understand it, L1P is a little awkward in > that L2 data doesn't get moved to L1P, it gets moved to L1. L1P is > really a prefetch buffer, right? One wouldn't do, say, cache blocking > for L1P. In that sense maybe modeling it as a cache level isn't the > right thing.The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking indeed makes no sense. But when optimizing for it, I could not just ignore it. However, maybe we should leave it out for our API consideration. The Blue Gene/Q is phasing out and I know no other architecture which has this such a cache hierarchy.> How does software make use of L1P? I understand compilers can insert > data prefetches and the data resides in L1P, presumably until it is > accessed and then it moves to L1. I suppose the size of L1P could > determine how aggressively compilers prefetch. Is that the idea or are > you thinking of something else?I declared streams for the CPU to prefetch (which 'run' at different speeds over the memory), which, at some point in time I can assume to be in the L1P cache. Using the dcbt instruction, the cache line can be lifted from the L1P to the L1 cache, a fixed number of cycles in advance. If the cache line had to be prefetched from L2, the prefetch/access latency would be longer (24 cycles vs 82 cycles). Michael