Renato Golin via llvm-dev
2018-Nov-02 20:06 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Hey, I've been reading back the thread and there's a lot of ideas flying around, I may have missed more than I should, but here's my view on it. First, I think this is a good idea. Mapping caches is certainly interesting to general architectures, but particularly important to massive operations like matrix multiply and stencils can pull a lot of data into cache and sometimes thrash it if not careful. With scalable and larger vectors, this will be even more important. Overall, I think this is a good idea, but the current proposal is too detailed on the implementation and not enough on the use for me to have a good idea how and where this will be used. Can you describe a few situations where these new interfaces would be used and how? Some comments inline. On Thu, 1 Nov 2018 at 21:56, David Greene via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Ok. I would like to start posting patches for review without > speculating too much on fancy/exotic things that may come later. We > shouldn't do anything that precludes extensions but I don't want to get > bogged down in a lot of details on things related to a small number of > targets. Let's get the really common stuff in first. What do you > think?In theory, both big and little cores should have the same cache structure, so we don't necessarily need extra descriptions for both. In practice, sub-architectures can have multiple combinations of big.LITTLE cores and it's simply not practical to add that to table-gen. -- cheers, --renato
David Greene via llvm-dev
2018-Nov-05 15:56 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Renato Golin <renato.golin at linaro.org> writes:> Mapping caches is certainly interesting to general architectures, but > particularly important to massive operations like matrix multiply and > stencils can pull a lot of data into cache and sometimes thrash it if > not careful.Exactly right.> With scalable and larger vectors, this will be even more important.True.> Overall, I think this is a good idea, but the current proposal is too > detailed on the implementation and not enough on the use for me to > have a good idea how and where this will be used. > > Can you describe a few situations where these new interfaces would be > used and how?Sure. The prefetching interfaces are already used, though in a different form, by the LoopDataPrefetch pass. The cache interfaces are flexible enough to allow passes to answer questions like, "how much effective cache is available for this core (thread, etc.)?" That's a critical question to reason about the thrashing behavior you mentioned above. Knowing the cache line size is important for prefetching and various other memory operations such as streaming. Knowing the number of ways can allow one to guesstimate which memory accesses are likely to collide in the cache. It also happens that all of these parameters are useful for simulation purposes, which may help projects like llvm-mca.> On Thu, 1 Nov 2018 at 21:56, David Greene via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> Ok. I would like to start posting patches for review without >> speculating too much on fancy/exotic things that may come later. We >> shouldn't do anything that precludes extensions but I don't want to get >> bogged down in a lot of details on things related to a small number of >> targets. Let's get the really common stuff in first. What do you >> think? > > In theory, both big and little cores should have the same cache > structure, so we don't necessarily need extra descriptions for both. > > In practice, sub-architectures can have multiple combinations of > big.LITTLE cores and it's simply not practical to add that to > table-gen.I'm not quite grasping this. Are you saying that a partcular subtarget may have multiple "clusters" of big.LITTLE cores and that each cluster may look different from the others? -David
Renato Golin via llvm-dev
2018-Nov-05 17:08 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
On Mon, 5 Nov 2018 at 15:56, David Greene <dag at cray.com> wrote:> The cache interfaces are flexible enough to allow passes to answer > questions like, "how much effective cache is available for this core > (thread, etc.)?" That's a critical question to reason about the > thrashing behavior you mentioned above. > > Knowing the cache line size is important for prefetching and various > other memory operations such as streaming. > > Knowing the number of ways can allow one to guesstimate which memory > accesses are likely to collide in the cache. > > It also happens that all of these parameters are useful for simulation > purposes, which may help projects like llvm-mca.I see. So, IIGIR, initially, this would consolidate the prefetching infrastructure, which is a worthy goal in itself and would require a minimalist implementation for now. But later, vectorisers could use that info, for example, to understand how much would be beneficial to unroll vectorised loops (where total access size should be a multiple of the cache line), etc. Ultimately, simulations would be an interesting use of it, but shouldn't be a driving force for additional features bundled into the initial design.> I'm not quite grasping this. Are you saying that a partcular subtarget > may have multiple "clusters" of big.LITTLE cores and that each cluster > may look different from the others?Yeah, "big.LITTLE" [1] is a marketing name and can mean a bunch of different scenarios. For example: - List of big+little cores seen by the kernel as a single core but actually being two separate cores, and scheduled by the kernel via frequency scaling. - Two entirely separate clusters flipped between all big or all little - Heterogeneous mix, which could have different number of big and little cores with no cache need of coherence between them. Junos have two little and four big, Tegras have one little and four big. There are also other designs with dozens of huge cores plus a tiny core for management purposes. But it's worse, because different releases of the same family can have different core counts, change model (clustered/bundled/heterogeneous) and there's no way to currently represent that in table-gen. Given that the kernel has such a high influence how those cores get scheduled and preempted, I don't think there's any hope that the compiler can do a good job at predicting usage or having any real impact amidst higher level latency, such as context switches and systemcalls. -- cheers, --renato [1] https://en.wikipedia.org/wiki/ARM_big.LITTLE