Michael Kruse via llvm-dev
2018-Nov-07 23:26 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Am Mo., 5. Nov. 2018 um 10:26 Uhr schrieb David Greene <dag at cray.com>:> Yes, I agree the terminology is confusing. I used the term "stream" in > the sense of stream processing (https://en.wikipedia.org/wiki/Stream_processing). > The programming model is very different, of course, but the idea of a > stream of data that is acted upon and then essentially discarded is > similar. > > > In the the latter interpretation, what does "number of streams" mean? > > AFAIU the processer will just queue memory operations (e.g. for > > writing to RAM). Is it the maximum number of operations in the queue? > > On X86 NT writes are to so-called "write-combined memory." > Hardware-wise, that translates to some number of buffers where stores > are collected and merged, resulting in a (hopefully) single cache > line-sized write to memory that aggregates some number of individual > stores. The number of buffers limits the number of independent sets of > stores that can be active. For example, if the hardware has four such > buffers, I can do this and be fine: > > for (...) > A[i] = ... > B[i] = ... > C[i] = ... > D[i] = ... > > The sequence of (non-temporal) stores to each array will map to separate > hardware buffers and be write-combined in the way one would expect. But > this causes a problem: > > for (...) > A[i] = ... > B[i] = ... > C[i] = ... > D[i] = ... > E[i] = ... > > If all of the stores are non-temporal, then at least two of the array > store sequences will interfere with each other in the write-combining > buffers and will force early flushes of the buffer, effectively turning > them into single-store writes to memory. That's bad. > > Maybe the proper name for this concept is simply "WriteCombiningBuffer." > I'm not sure if some other architecture might have a concept of store > buffers that does something other than write-combining, so I was trying > to use a fairly generic name to mean, "some compiler-controlled hardware > buffer." > > There's a similar concept on the load side though I don't know if any > existing processors actually implement things that way. I know of > (academic) architectures where prefetches fill independent prefetch > buffers and one wouldn't want to prefetch too many different things > because they would start filling each others' buffers. That kind of > behavior could be captured by this model. > > The key factor is contention. There's a limited hardware memory buffer > resource and compilers have to be careful not to oversubscribe it. I > don't know what the right name for it is. Probably there will be more > than one such type of resource for some architectures, so for now maybe > we just model write-combining buffers and leave it at that. If other > such resources pop up we can model them with different names. > > I think that's what we should do.Thank you for the detailed explanation. We could use a notion of "sustainable stream", i.e. the maximum number of (consecutive?) read/write streams that a processor can support before a disproportional loss in performance happens. This is oblivious to the reason why that performance loss happens, be it write combining buffers or prefetch streams. If there multiple such bottlenecks, it would be the minimum of such streams. At the moment I cannot think of an optimization where the difference matters (which doesn't mean there isn't a case where it does).> That seems ok to me. As I understand it, L1P is a little awkward in > that L2 data doesn't get moved to L1P, it gets moved to L1. L1P is > really a prefetch buffer, right? One wouldn't do, say, cache blocking > for L1P. In that sense maybe modeling it as a cache level isn't the > right thing.The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking indeed makes no sense. But when optimizing for it, I could not just ignore it. However, maybe we should leave it out for our API consideration. The Blue Gene/Q is phasing out and I know no other architecture which has this such a cache hierarchy.> How does software make use of L1P? I understand compilers can insert > data prefetches and the data resides in L1P, presumably until it is > accessed and then it moves to L1. I suppose the size of L1P could > determine how aggressively compilers prefetch. Is that the idea or are > you thinking of something else?I declared streams for the CPU to prefetch (which 'run' at different speeds over the memory), which, at some point in time I can assume to be in the L1P cache. Using the dcbt instruction, the cache line can be lifted from the L1P to the L1 cache, a fixed number of cycles in advance. If the cache line had to be prefetched from L2, the prefetch/access latency would be longer (24 cycles vs 82 cycles). Michael
David Greene via llvm-dev
2018-Nov-08 16:35 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Michael Kruse <llvmdev at meinersbur.de> writes:> Thank you for the detailed explanation. We could use a notion of > "sustainable stream", i.e. the maximum number of (consecutive?) > read/write streams that a processor can support before a > disproportional loss in performance happens. This is oblivious to the > reason why that performance loss happens, be it write combining > buffers or prefetch streams. If there multiple such bottlenecks, it > would be the minimum of such streams. At the moment I cannot think of > an optimization where the difference matters (which doesn't mean there > isn't a case where it does).What about load prefetching vs. non-temporal stores on X86? There's a limited number of write-combining buffers but prefetches "just" use the regular load paths. Yes, there's a limited number of load buffers but I would expect the the number of independent prefetch streams one would want could differ substantially from the number of independent non-tempooral store streams one would want and you wouldn't want the minimum to apply to the other. I like the idea of abstracting the hardware resource for the compiler's needs, though I think we will in general want multiple such things. Maybe one for load and one for store to start? For more harware-y things like llvm-mca more detail may be desired.>> That seems ok to me. As I understand it, L1P is a little awkward in >> that L2 data doesn't get moved to L1P, it gets moved to L1. L1P is >> really a prefetch buffer, right? One wouldn't do, say, cache blocking >> for L1P. In that sense maybe modeling it as a cache level isn't the >> right thing. > > The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking > indeed makes no sense. > > But when optimizing for it, I could not just ignore it. However, maybe > we should leave it out for our API consideration. The Blue Gene/Q is > phasing out and I know no other architecture which has this such a > cache hierarchy.Ok. See more below.>> How does software make use of L1P? I understand compilers can insert >> data prefetches and the data resides in L1P, presumably until it is >> accessed and then it moves to L1. I suppose the size of L1P could >> determine how aggressively compilers prefetch. Is that the idea or are >> you thinking of something else? > > I declared streams for the CPU to prefetch (which 'run' at different > speeds over the memory), which, at some point in time I can assume to > be in the L1P cache. Using the dcbt instruction, the cache line can be > lifted from the L1P to the L1 cache, a fixed number of cycles in > advance. If the cache line had to be prefetched from L2, the > prefetch/access latency would be longer (24 cycles vs 82 cycles).Ok, I understand better now, thanks. L1P really is a prefetch buffer but there's software control to move it to faster cache if desired. Should we model it as part of the prefetching API? -David
Finkel, Hal J. via llvm-dev
2018-Nov-08 17:09 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
On 11/08/2018 10:35 AM, David Greene via llvm-dev wrote:> Michael Kruse <llvmdev at meinersbur.de> writes: > >> Thank you for the detailed explanation. We could use a notion of >> "sustainable stream", i.e. the maximum number of (consecutive?) >> read/write streams that a processor can support before a >> disproportional loss in performance happens. This is oblivious to the >> reason why that performance loss happens, be it write combining >> buffers or prefetch streams. If there multiple such bottlenecks, it >> would be the minimum of such streams. At the moment I cannot think of >> an optimization where the difference matters (which doesn't mean there >> isn't a case where it does). > What about load prefetching vs. non-temporal stores on X86? There's a > limited number of write-combining buffers but prefetches "just" use the > regular load paths. Yes, there's a limited number of load buffers but I > would expect the the number of independent prefetch streams one would > want could differ substantially from the number of independent > non-tempooral store streams one would want and you wouldn't want the > minimum to apply to the other. > > I like the idea of abstracting the hardware resource for the compiler's > needs, though I think we will in general want multiple such things. > Maybe one for load and one for store to start? For more harware-y > things like llvm-mca more detail may be desired. > >>> That seems ok to me. As I understand it, L1P is a little awkward in >>> that L2 data doesn't get moved to L1P, it gets moved to L1. L1P is >>> really a prefetch buffer, right? One wouldn't do, say, cache blocking >>> for L1P. In that sense maybe modeling it as a cache level isn't the >>> right thing. >> The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking >> indeed makes no sense. >> >> But when optimizing for it, I could not just ignore it. However, maybe >> we should leave it out for our API consideration. The Blue Gene/Q is >> phasing out and I know no other architecture which has this such a >> cache hierarchy. > Ok. See more below. > >>> How does software make use of L1P? I understand compilers can insert >>> data prefetches and the data resides in L1P, presumably until it is >>> accessed and then it moves to L1. I suppose the size of L1P could >>> determine how aggressively compilers prefetch. Is that the idea or are >>> you thinking of something else? >> I declared streams for the CPU to prefetch (which 'run' at different >> speeds over the memory), which, at some point in time I can assume to >> be in the L1P cache. Using the dcbt instruction, the cache line can be >> lifted from the L1P to the L1 cache, a fixed number of cycles in >> advance. If the cache line had to be prefetched from L2, the >> prefetch/access latency would be longer (24 cycles vs 82 cycles). > Ok, I understand better now, thanks. L1P really is a prefetch buffer > but there's software control to move it to faster cache if desired. > Should we model it as part of the prefetching API?At this point, I'd not base any API-structuring decisions on the BG/Q specifically. The generic feature that might be worth modeling is: Into what level of cache does automated prefetching take place? I know of several architectures that don't do automated prefetching into the L1, but only into the L2 (or similar). -Hal> > -David > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Michael Kruse via llvm-dev
2018-Nov-09 22:31 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Am Do., 8. Nov. 2018 um 10:36 Uhr schrieb David Greene <dag at cray.com>:> What about load prefetching vs. non-temporal stores on X86? There's a > limited number of write-combining buffers but prefetches "just" use the > regular load paths. Yes, there's a limited number of load buffers but I > would expect the the number of independent prefetch streams one would > want could differ substantially from the number of independent > non-tempooral store streams one would want and you wouldn't want the > minimum to apply to the other. > > I like the idea of abstracting the hardware resource for the compiler's > needs, though I think we will in general want multiple such things. > Maybe one for load and one for store to start? For more harware-y > things like llvm-mca more detail may be desired.Your RFC already has getNumStoreBuffers, getNumLoadBuffers and getNumLoadStoreBuffers, no? As far I understand, write-combining only applies to getNumStoreBuffers(). Prefetch streams would limit getNumLoadBuffers. Michael