Michael Kruse via llvm-dev
2018-Nov-01 17:30 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Hi, thank you for sharing the system hierarchy model. IMHO it makes a lot of sense, although I don't know which of today's passes would make use of it. Here are my remarks. I am wondering how one could model the following features using this model, or whether they should be part of a performance model at all: * ARM's big.LITTLE * NUMA hierarchies (are the NUMA domains 'caches'?) * Total available RAM * remote memory (e.g. RAM on an accelerator mapped into the address space) * scratch pad * write-back / write-through write buffers * page size * TLB capacity * constructive/destructive interference (https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size) Some architecture have instructions to zero entire cache lines, e.g. dcbz on PowerPC, but it requires the cache line to be correct. Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/ * Instruction cache Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev <llvm-dev at lists.llvm.org>:> class TargetCacheLevelInfo { > /// getWays - Return the number of ways. > /// > unsigned getWays() const;That is, associativity? Bandwidth might be a useful addition, e.g. if a performance analysis tools uses the roofline model.> class TargetSoftwarePrefetcherInfo { > /// Should we do software prefetching at all? > /// > bool isEnabled() const;isEnabled sounds like something configurable at runtime.> /// Provide a general prefetch distance hint. > /// > unsigned getDistance() const; > > /// Prefetch at least this far ahead. > /// > unsigned getMinDistance() const; > > /// Prefetch at most this far ahead. > /// > unsigned getMaxDistance() const; > }; > > ``get*Distance`` APIs provide general hints to guide the software > prefetcher. The software prefetcher may choose to ignore them. > getMinDistance and getMaxDistance act as clamps to ensure the software > prefetcher doesn't do something wholly inappropriate. > > Distances are specified in terms of cache lines. The current > ``TargetTransformInfo`` interfaces speak in terms of instructions or > iterations ahead. Both can be useful and so we may want to add > iteration and/or instruction distances to this interface.Would it make sense to specify a prefetch distance in bytes instead of cache lines? The cache line might not be known at compile-time (e.g. ARM big.LITTLE), but it might still make sense to do software prefetching.> class TargetStreamBufferInfo { > /// getNumLoadBuffers - Return the number of load buffers available. > /// This is the number of simultaneously active independent load > /// streams the processor can handle before degrading performance. > /// > int getNumLoadBuffers() const; > > /// getMaxNumLoadBuffers - Return the maximum number of load > /// streams that may be active before shutting off streaming > /// entirely. -1 => no limit. > /// > int getMaxNumLoadBuffers(); > > /// getNumStoreBuffers - Return the effective number of store > /// buffers available. This is the number of simultaneously > /// active independent store streams the processor can handle > /// before degrading performance. > /// > int getNumStoreBuffers() const; > > /// getMaxNumStoreBuffers - Return the maximum number of store > /// streams that may be active before shutting off streaming > /// entirely. -1 => no limit. > /// > int getMaxNumStoreBuffers() const; > > /// getNumLoadStoreBuffers - Return the effective number of > /// buffers available for streams that both load and store data. > /// This is the number of simultaneously active independent > /// load-store streams the processor can handle before degrading > /// performance. > /// > int getNumLoadStoreBuffers() const; > > /// getMaxNumLoadStoreBuffers - Return the maximum number of > /// load-store streams that may be active before shutting off > /// streaming entirely. -1 => no limit. > /// > int getMaxNumLoadStoreBuffers() const; > }; > > Code uses the ``getMax*Buffers`` APIs to judge whether streaming > should be done at all. For example, if the number of available > streams greatly outweighs the hardware available, it makes little > sense to do streaming. Performance will be dominated by the streams > that don't make use of the hardware and the streams that do make use > of the hardware may actually perform worse.What count's as steam? Some processors may support streams with strides and/or backward stream. Is there a way on which level the number of streams are shared? For instance, a core might be able to track 16 streams, but if 4 threads are running (SMT), each can only use 4. PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the hardware which streams it should establish. Do the buffer counts include explicitly and automatically established streams? Do non-stream accesses (e.g. stack access) count towards> class TargetMemorySystemInfo { > const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const; > > /// getNumLevels - Return the number of cache levels this target has. > /// > unsigned getNumLevels() const; > > /// Cache level iterators > /// > cachelevel_iterator cachelevel_begin() const; > cachelevel_iterator cachelevel_end() const;May users of this class assume that a level refers to a specific cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to search for a cache of a specific size?> //===--------------------------------------------------------------------===// > // Stream Buffer Information > // > const TargetStreamBufferInfo *getStreamBufferInfo() const; > > //===--------------------------------------------------------------------===// > // Software Prefetcher Information > // > const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;Would it make sense to have one PrefetcherInfo/StreamBuffer per cache level? Some ISA have multiple prefetchers/prefetch instructructions for different levels.> class TargetExecutionResourceInfo { > /// getContained - Return information about the contained execution > /// resource. > /// > TargetExecutionResourceInfo *getContained() const; > > /// getNumContained - Return the number of contained execution > /// resources. > /// > unsigned getNumContained() const;Shouldn't the level itself specify how many of resources of its there are, instead of its parent? This would make TargetExecutionEngineInfo::getNumResources() reduntant. E.g. assume that "Socket" is the outermost resource level. The number of sockets in the system could be returned by its TargetExecutionResourceInfo instead of TargetExecutionEngineInfo::getNumResources().> }; > > Each execution resource may *contain* other execution resources. For > example, a socket may contain multiple cores and a core may contain > multiple hardware threads (e.g. SMT contexts). An execution resource > may have cache levels associated with it. If so, that cache level is > private to the execution resource. For example the first-level cache > may be private to a core and shared by the threads within the core, > the second-level cache may be private to a socket and the third-level > cache may be shared by all sockets.Should there be an indicator whether a resource is shared or separate. E.g. SMT threads (and AMD "Modules") share functional units, but cores/sockets do not.> /// TargetExecutionEngineInfo base class - We assume that the target > /// defines a static array of TargetExecutionResourceInfo objects that > /// represent all of the execution resources that the target has. As > /// such, we simply have to track a pointer to this array. > /// > class TargetExecutionEngineInfo { > public: > typedef ... resource_iterator; > > //===--------------------------------------------------------------------===// > // Resource Information > // > > /// getResource - Get an execution resource by resource ID. > /// > const TargetExecutionResourceInfo &getResource(unsigned Resource) const; > > /// getNumResources - Return the number of resources this target has. > /// > unsigned getNumResources() const; > > /// Resource iterators > /// > resource_iterator resource_begin() const; > resource_iterator resource_end() const; > }; > > The target execution engine allows optimizers to make intelligent > choices for cache optimization in the presence of parallelism, where > multiple threads may be competing for cache resources.Do you have examples on what optimizations make use of this information? It sounds like this info is relevant to the OS scheduler than the compiler.> Currently the resource iterators will walk over all resources (cores, > threads, etc.). Alternatively, we could say that iterators walk over > "top level" resources and contained resources must be accessed via > their containing resources.Most of the time programs are not compiled for specific system configurations (number of sockets, how many cores your processor has, or how many threads the OS allows the program to run). Meaning this information will usually be unknown at compile-time. What is the intention? Pass the system configuration as flag to the processor? Is it only available while JITing?> Here we see one of the flaws in the model. Because of the way > ``Socket``, ``Module`` and ``Thread`` are defined above, we're forced > to include a ``Module`` level even though it really doesn't make sense > for our ShyEnigma processor. A ``Core`` has two ``Thread`` resources, > a ``Module`` has one ``Core`` resource and a ``Socket`` has eight > ``Module`` resources. In reality, a ShyEnigma core has two threads > and a ShyEnigma socket has eight cores. At least for this SKU (more > on that below).Is this a restriction of TableGen? If the "Module" level is not required, could the SubtargetInfo just return Socket->Thread. Or is there a global requirement that every architecture has to define the same number of level?> An open question is how to handle different SKUs within a subtarget > family. We modeled the limited number of SKUs used in our products > via multiple subtargets, so this wasn't a heavy burden for us, but a > more robust implementation might allow for multiple ``MemorySystem`` > and/or ``ExecutionEngine`` models for a given subtarget. It's not yet > clear whether that's a good/necessary thing and if it is, how to > specify it with a compiler switch. ``-mcpu=shy-enigma > -some-switch-to-specify-memory-and-execution-models``? It may very > well be sufficient to have a general system model that applies > relatively well over multiple SKUs.Adding more specific subtargets with more refined execution models seem fine for me. But is it reasonable to manage a database of all processors ever produced in the compiler? Michael
David Greene via llvm-dev
2018-Nov-01 20:21 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Michael, thank you for commenting! Responses inline. Let's continue discussing and if this seems like a reasonable way to proceed, I can start posting patches for review. -David Michael Kruse <llvmdev at meinersbur.de> writes:> thank you for sharing the system hierarchy model. IMHO it makes a lot > of sense, although I don't know which of today's passes would make use > of it. Here are my remarks.LoopDataPrefetch would use it via the existing TTI interfaces, but I think that's about it for now. It's a bit of a chicken-and-egg, in that passes won't use it if it's not there and there's no push to get it in because few things use it. :)> I am wondering how one could model the following features using this > model, or whether they should be part of a performance model at all: > > * ARM's big.LITTLEHow is this modeled in the current AArch64 .td files? The current design doesn't capture heterogeneity at all, not because we're not interested but simply because our compiler captures that at a higher level outside of LLVM.> * NUMA hierarchies (are the NUMA domains 'caches'?) > > * Total available RAM > > * remote memory (e.g. RAM on an accelerator mapped into the address space) > > * scratch padI expect we would expand TargetMemorySystemInfo to hold different kinds of memory-related things. Each of these could be a memory resource. Or maybe we would want something that lives "next to" TargetMemorySystemInfo.> * write-back / write-through write buffersDo you mean for caches, or something else?> * page size > > * TLB capacity> * constructive/destructive interference > (https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size) > Some architecture have instructions to zero entire cache lines, > e.g. dcbz on PowerPC, but it requires the cache line to be correct. > Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/ > > * Instruction cacheThese could go into TargetMemorySystemInfo I think.> Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev > <llvm-dev at lists.llvm.org>: >> class TargetCacheLevelInfo { >> /// getWays - Return the number of ways. >> /// >> unsigned getWays() const; > > That is, associativity?Yes. Naming is certainly flexible.> Bandwidth might be a useful addition, e.g. if a performance analysis > tools uses the roofline model.Yes.>> class TargetSoftwarePrefetcherInfo { >> /// Should we do software prefetching at all? >> /// >> bool isEnabled() const; > > isEnabled sounds like something configurable at runtime.Currently we use it to allow some subtargets to do software prefetching and prevent it for others. I see how the name could be confusing though. Maybe ShouldDoPrefetching?>> ``get*Distance`` APIs provide general hints to guide the software >> prefetcher. The software prefetcher may choose to ignore them. >> getMinDistance and getMaxDistance act as clamps to ensure the software >> prefetcher doesn't do something wholly inappropriate. >> >> Distances are specified in terms of cache lines. The current >> ``TargetTransformInfo`` interfaces speak in terms of instructions or >> iterations ahead. Both can be useful and so we may want to add >> iteration and/or instruction distances to this interface. > > Would it make sense to specify a prefetch distance in bytes instead of > cache lines? The cache line might not be known at compile-time (e.g. > ARM big.LITTLE), but it might still make sense to do software > prefetching.Sure, I think that would make sense.>> Code uses the ``getMax*Buffers`` APIs to judge whether streaming >> should be done at all. For example, if the number of available >> streams greatly outweighs the hardware available, it makes little >> sense to do streaming. Performance will be dominated by the streams >> that don't make use of the hardware and the streams that do make use >> of the hardware may actually perform worse. > > What count's as steam? Some processors may support streams with > strides and/or backward stream.Yes. We may want some additional information here to describe the hardware's capability.> Is there a way on which level the number of streams are shared? For > instance, a core might be able to track 16 streams, but if 4 threads > are running (SMT), each can only use 4.I suppose we could couple the streaming information to an execution resource, similar to what is done with cache levels to express this kind of sharing. We haven't found a need for it but that doesn't mean it wouldn't be useful for other/new targets.> PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the > hardware which streams it should establish. Do the buffer counts > include explicitly and automatically established streams? Do > non-stream accesses (e.g. stack access) count towardsIt's up to the target maintainer to decide what the numbers mean. Obviously passes have to have some notion of what things mean. The thing that establishes what a "stream" is in the user program lives outside of the system model. It may or may not consider random stack accesses as part of a stream. This is definitely an area for exploration. Since we only have machines with two major targets, we didn't need to contend with more exotic things. :)>> class TargetMemorySystemInfo { >> const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const; >> >> /// getNumLevels - Return the number of cache levels this target has. >> /// >> unsigned getNumLevels() const; >> >> /// Cache level iterators >> /// >> cachelevel_iterator cachelevel_begin() const; >> cachelevel_iterator cachelevel_end() const; > > May users of this class assume that a level refers to a specific > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to > search for a cache of a specific size?The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is the L2 cache and so on.>> //===--------------------------------------------------------------------===// >> // Stream Buffer Information >> // >> const TargetStreamBufferInfo *getStreamBufferInfo() const; >> >> //===--------------------------------------------------------------------===// >> // Software Prefetcher Information >> // >> const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const; > > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache > level? Some ISA have multiple prefetchers/prefetch instructructions > for different levels.Probably. Most X86 implementations direct all data prefetches to the same cache level so we didn't find a need to model this, but it makes sense to allow for it.>> class TargetExecutionResourceInfo { >> /// getContained - Return information about the contained execution >> /// resource. >> /// >> TargetExecutionResourceInfo *getContained() const; >> >> /// getNumContained - Return the number of contained execution >> /// resources. >> /// >> unsigned getNumContained() const; > > Shouldn't the level itself specify how many of resources of its there > are, instead of its parent? > This would make TargetExecutionEngineInfo::getNumResources() reduntant. > > E.g. assume that "Socket" is the outermost resource level. The number > of sockets in the system could be returned by its > TargetExecutionResourceInfo instead of > TargetExecutionEngineInfo::getNumResources().That could work I think and would probably be a bit easier to understand.>> }; >> >> Each execution resource may *contain* other execution resources. For >> example, a socket may contain multiple cores and a core may contain >> multiple hardware threads (e.g. SMT contexts). An execution resource >> may have cache levels associated with it. If so, that cache level is >> private to the execution resource. For example the first-level cache >> may be private to a core and shared by the threads within the core, >> the second-level cache may be private to a socket and the third-level >> cache may be shared by all sockets. > > Should there be an indicator whether a resource is shared or separate. > E.g. SMT threads (and AMD "Modules") share functional units, but > cores/sockets do not.Interesting idea. I suppose we could model that with another resource type similar to the way caches are handled. Then the resources could be coupled to execution resources to express the sharing. We hadn't found a need for this level of detail in the work we've done but it could be useful for lots of things.>> /// TargetExecutionEngineInfo base class - We assume that the target >> /// defines a static array of TargetExecutionResourceInfo objects that >> /// represent all of the execution resources that the target has. As >> /// such, we simply have to track a pointer to this array. >> /// >> class TargetExecutionEngineInfo { >> public: >> typedef ... resource_iterator; >> >> //===--------------------------------------------------------------------===// >> // Resource Information >> // >> >> /// getResource - Get an execution resource by resource ID. >> /// >> const TargetExecutionResourceInfo &getResource(unsigned Resource) const; >> >> /// getNumResources - Return the number of resources this target has. >> /// >> unsigned getNumResources() const; >> >> /// Resource iterators >> /// >> resource_iterator resource_begin() const; >> resource_iterator resource_end() const; >> }; >> >> The target execution engine allows optimizers to make intelligent >> choices for cache optimization in the presence of parallelism, where >> multiple threads may be competing for cache resources. > > Do you have examples on what optimizations make use of this > information? It sounds like this info is relevant to the OS scheduler > than the compiler.Sure. Cache blocking is one. Let's assume an L2 cache shared among cores. Let's also assume the program is going to use threads within a core. You wouldn't want the compiler to cache block assuming the whole size of L2, you'd want to cache block for some partition of L2 given the execution resources the code is going to use.>> Currently the resource iterators will walk over all resources (cores, >> threads, etc.). Alternatively, we could say that iterators walk over >> "top level" resources and contained resources must be accessed via >> their containing resources. > > Most of the time programs are not compiled for specific system > configurations (number of sockets, how many cores your processor has, > or how many threads the OS allows the program to run). Meaning this > information will usually be unknown at compile-time. > What is the intention? Pass the system configuration as flag to the > processor? Is it only available while JITing?On our machines it is very common for customers to compile for specific system configurations and we provide pre-canned compiler configurations to make it convenient to do so. Every 1% speedup matters in HPC. :) This certainly could be used in a JIT but that wasn't the motivation for the design.>> Here we see one of the flaws in the model. Because of the way >> ``Socket``, ``Module`` and ``Thread`` are defined above, we're forced >> to include a ``Module`` level even though it really doesn't make sense >> for our ShyEnigma processor. A ``Core`` has two ``Thread`` resources, >> a ``Module`` has one ``Core`` resource and a ``Socket`` has eight >> ``Module`` resources. In reality, a ShyEnigma core has two threads >> and a ShyEnigma socket has eight cores. At least for this SKU (more >> on that below). > > Is this a restriction of TableGen? If the "Module" level is not > required, could the SubtargetInfo just return Socket->Thread. Or is > there a global requirement that every architecture has to define the > same number of level?No, the number of levels isn't fixed. The issue is the way that Socket is defined: class Module<int numcores> : ExecutionResource<"Module", "Core", numcores>; class Socket<int nummodules> : ExecutionResource<"Socket", "Module", nummodules>; It refers to "Module" by name. The TableGen backend picks up on this and connects the resources appropriately. This is definitely something that will need work as patches are developed. It's possible that your idea of e.g. shared function units above could capture this.>> An open question is how to handle different SKUs within a subtarget >> family. We modeled the limited number of SKUs used in our products >> via multiple subtargets, so this wasn't a heavy burden for us, but a >> more robust implementation might allow for multiple ``MemorySystem`` >> and/or ``ExecutionEngine`` models for a given subtarget. It's not yet >> clear whether that's a good/necessary thing and if it is, how to >> specify it with a compiler switch. ``-mcpu=shy-enigma >> -some-switch-to-specify-memory-and-execution-models``? It may very >> well be sufficient to have a general system model that applies >> relatively well over multiple SKUs. > > Adding more specific subtargets with more refined execution models > seem fine for me. > But is it reasonable to manage a database of all processors ever > produced in the compiler?No it is not. :) That's why this is an open question. We've found it perfectly adequate to define a single model for each major processor generation, but as I said we use a limited number of SKUs. We will need input from the community on this.
Michael Kruse via llvm-dev
2018-Nov-01 21:36 UTC
[llvm-dev] RFC: System (cache, etc.) model for LLVM
Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at cray.com>>> > thank you for sharing the system hierarchy model. IMHO it makes a lot > > of sense, although I don't know which of today's passes would make use > > of it. Here are my remarks. > > LoopDataPrefetch would use it via the existing TTI interfaces, but I > think that's about it for now. It's a bit of a chicken-and-egg, in that > passes won't use it if it's not there and there's no push to get it in > because few things use it. :)What kind of passes is using it in the Cray compiler?> > I am wondering how one could model the following features using this > > model, or whether they should be part of a performance model at all: > > > > * ARM's big.LITTLE > > How is this modeled in the current AArch64 .td files? The current > design doesn't capture heterogeneity at all, not because we're not > interested but simply because our compiler captures that at a higher > level outside of LLVM.AFAIK it is not handled at all. Any architecture that supports big.LITTLE will return 0 on getCacheLineSize(). See AArch64Subtarget::initializeProperties().> > * write-back / write-through write buffers > > Do you mean for caches, or something else?https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies Basically, with write-though, every store is a non-temporal store (Or temporal stores being a write-through, depending on how to view it)> >> class TargetSoftwarePrefetcherInfo { > >> /// Should we do software prefetching at all? > >> /// > >> bool isEnabled() const; > > > > isEnabled sounds like something configurable at runtime. > > Currently we use it to allow some subtargets to do software prefetching > and prevent it for others. I see how the name could be confusing > though. Maybe ShouldDoPrefetching?isPrefetchingProfitable()? If it is a hardware property: isSupported() (ie. prefetch instruction would be a no-op on other implementations)> > Is there a way on which level the number of streams are shared? For > > instance, a core might be able to track 16 streams, but if 4 threads > > are running (SMT), each can only use 4. > > I suppose we could couple the streaming information to an execution > resource, similar to what is done with cache levels to express this kind > of sharing. We haven't found a need for it but that doesn't mean it > wouldn't be useful for other/new targets.The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the > > hardware which streams it should establish. Do the buffer counts > > include explicitly and automatically established streams? Do > > non-stream accesses (e.g. stack access) count towards > > It's up to the target maintainer to decide what the numbers mean. > Obviously passes have to have some notion of what things mean. The > thing that establishes what a "stream" is in the user program lives > outside of the system model. It may or may not consider random stack > accesses as part of a stream. > > This is definitely an area for exploration. Since we only have machines > with two major targets, we didn't need to contend with more exotic > things. :)IMHO it would be good if passes and targets agree on an interpretation of this number when designing the interface. Again, from the Blue Gene/Q: What counts as stream is configurable at runtime via a hardware register. It supports 3 settings: * Interpret every memory access as start of a stream * Interpret a stream when there are 2 consecutive cache misses * Only establish streams via dcbt instructions.> >> class TargetMemorySystemInfo { > >> const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const; > >> > >> /// getNumLevels - Return the number of cache levels this target has. > >> /// > >> unsigned getNumLevels() const; > >> > >> /// Cache level iterators > >> /// > >> cachelevel_iterator cachelevel_begin() const; > >> cachelevel_iterator cachelevel_end() const; > > > > May users of this class assume that a level refers to a specific > > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to > > search for a cache of a specific size? > > The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is > the L2 cache and so on.Can passes rely on it?> >> //===--------------------------------------------------------------------===// > >> // Stream Buffer Information > >> // > >> const TargetStreamBufferInfo *getStreamBufferInfo() const; > >> > >> //===--------------------------------------------------------------------===// > >> // Software Prefetcher Information > >> // > >> const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const; > > > > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache > > level? Some ISA have multiple prefetchers/prefetch instructructions > > for different levels. > > Probably. Most X86 implementations direct all data prefetches to the > same cache level so we didn't find a need to model this, but it makes > sense to allow for it.Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for prefetch), but a dcbt instruction is necessary to establish the cache line into the L1 cache.> >> An open question is how to handle different SKUs within a subtarget > >> family. We modeled the limited number of SKUs used in our products > >> via multiple subtargets, so this wasn't a heavy burden for us, but a > >> more robust implementation might allow for multiple ``MemorySystem`` > >> and/or ``ExecutionEngine`` models for a given subtarget. It's not yet > >> clear whether that's a good/necessary thing and if it is, how to > >> specify it with a compiler switch. ``-mcpu=shy-enigma > >> -some-switch-to-specify-memory-and-execution-models``? It may very > >> well be sufficient to have a general system model that applies > >> relatively well over multiple SKUs. > > > > Adding more specific subtargets with more refined execution models > > seem fine for me. > > But is it reasonable to manage a database of all processors ever > > produced in the compiler? > > No it is not. :) That's why this is an open question. We've found it > perfectly adequate to define a single model for each major processor > generation, but as I said we use a limited number of SKUs. We will > need input from the community on this.Independently on whether subtargets for SKUs are added, could we (also) be able to define these parameters via the command line. Like xlc's -qcache option. Michael