thr3ads.net - llvm dev - [llvm-dev] RFC: System (cache, etc.) model for LLVM [Nov 2018]

If this information is useful, please help other people find it:
Share via:

Michael Kruse via llvm-dev

2018-Nov-01 17:30 UTC

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Hi,

thank you for sharing the system hierarchy model. IMHO it makes a lot
of sense, although I don't know which of today's passes would make use
of it. Here are my remarks.

I am wondering how one could model the following features using this
model, or whether they should be part of a performance model at all:

 * ARM's big.LITTLE

 * NUMA hierarchies (are the NUMA domains 'caches'?)

 * Total available RAM

 * remote memory (e.g. RAM on an accelerator mapped into the address space)

 * scratch pad

 * write-back / write-through write buffers

 * page size

 * TLB capacity

 * constructive/destructive interference
(https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size)
   Some architecture have instructions to zero entire cache lines,
e.g. dcbz on PowerPC, but it requires the cache line to be correct.
Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/

 * Instruction cache



Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev
<llvm-dev at lists.llvm.org>:>   class TargetCacheLevelInfo {
>     /// getWays - Return the number of ways.
>     ///
>     unsigned getWays() const;
That is, associativity?

Bandwidth might be a useful addition, e.g. if a performance analysis
tools uses the roofline model.


>   class TargetSoftwarePrefetcherInfo {
>     /// Should we do software prefetching at all?
>     ///
>     bool isEnabled() const;
isEnabled sounds like something configurable at runtime.

>     /// Provide a general prefetch distance hint.
>     ///
>     unsigned getDistance() const;
>
>     /// Prefetch at least this far ahead.
>     ///
>     unsigned getMinDistance() const;
>
>     /// Prefetch at most this far ahead.
>     ///
>     unsigned getMaxDistance() const;
>   };
>
> ``get*Distance`` APIs provide general hints to guide the software
> prefetcher.  The software prefetcher may choose to ignore them.
> getMinDistance and getMaxDistance act as clamps to ensure the software
> prefetcher doesn't do something wholly inappropriate.
>
> Distances are specified in terms of cache lines.  The current
> ``TargetTransformInfo`` interfaces speak in terms of instructions or
> iterations ahead.  Both can be useful and so we may want to add
> iteration and/or instruction distances to this interface.
Would it make sense to specify a prefetch distance in bytes instead of
cache lines? The cache line might not be known at compile-time (e.g.
ARM big.LITTLE), but it might still make sense to do software
prefetching.



>   class TargetStreamBufferInfo {
>     /// getNumLoadBuffers - Return the number of load buffers available.
>     /// This is the number of simultaneously active independent load
>     /// streams the processor can handle before degrading performance.
>     ///
>     int getNumLoadBuffers() const;
>
>     /// getMaxNumLoadBuffers - Return the maximum number of load
>     /// streams that may be active before shutting off streaming
>     /// entirely.  -1 => no limit.
>     ///
>     int getMaxNumLoadBuffers();
>
>     /// getNumStoreBuffers - Return the effective number of store
>     /// buffers available.  This is the number of simultaneously
>     /// active independent store streams the processor can handle
>     /// before degrading performance.
>     ///
>     int getNumStoreBuffers() const;
>
>     /// getMaxNumStoreBuffers - Return the maximum number of store
>     /// streams that may be active before shutting off streaming
>     /// entirely.  -1 => no limit.
>     ///
>     int getMaxNumStoreBuffers() const;
>
>     /// getNumLoadStoreBuffers - Return the effective number of
>     /// buffers available for streams that both load and store data.
>     /// This is the number of simultaneously active independent
>     /// load-store streams the processor can handle before degrading
>     /// performance.
>     ///
>     int getNumLoadStoreBuffers() const;
>
>     /// getMaxNumLoadStoreBuffers - Return the maximum number of
>     /// load-store streams that may be active before shutting off
>     /// streaming entirely.  -1 => no limit.
>     ///
>     int getMaxNumLoadStoreBuffers() const;
>   };
>
> Code uses the ``getMax*Buffers`` APIs to judge whether streaming
> should be done at all.  For example, if the number of available
> streams greatly outweighs the hardware available, it makes little
> sense to do streaming.  Performance will be dominated by the streams
> that don't make use of the hardware and the streams that do make use
> of the hardware may actually perform worse.
What count's as steam? Some processors may support streams with
strides and/or backward stream.

Is there a way on which level the number of streams are shared? For
instance, a core might be able to track 16 streams, but if 4 threads
are running (SMT), each can only use 4.

PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
hardware which streams it should establish. Do the buffer counts
include explicitly and automatically established streams? Do
non-stream accesses (e.g. stack access) count towards

>   class TargetMemorySystemInfo {
>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
>
>     /// getNumLevels - Return the number of cache levels this target has.
>     ///
>     unsigned getNumLevels() const;
>
>     /// Cache level iterators
>     ///
>     cachelevel_iterator cachelevel_begin() const;
>     cachelevel_iterator cachelevel_end() const;
May users of this class assume that a level refers to a specific
cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
search for a cache of a specific size?

>    
//===--------------------------------------------------------------------===//
>     // Stream Buffer Information
>     //
>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
>
>    
//===--------------------------------------------------------------------===//
>     // Software Prefetcher Information
>     //
>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
level? Some ISA have multiple prefetchers/prefetch instructructions
for different levels.

>   class TargetExecutionResourceInfo {
>     /// getContained - Return information about the contained execution
>     /// resource.
>     ///
>     TargetExecutionResourceInfo *getContained() const;
>
>     /// getNumContained - Return the number of contained execution
>     /// resources.
>     ///
>     unsigned getNumContained() const;
Shouldn't the level itself specify how many of resources of its there
are, instead of its parent?
This would make TargetExecutionEngineInfo::getNumResources() reduntant.

E.g. assume that "Socket" is the outermost resource level. The number
of sockets in the system could be returned by its
TargetExecutionResourceInfo instead of
TargetExecutionEngineInfo::getNumResources().

>   };
>
> Each execution resource may *contain* other execution resources.  For
> example, a socket may contain multiple cores and a core may contain
> multiple hardware threads (e.g. SMT contexts).  An execution resource
> may have cache levels associated with it.  If so, that cache level is
> private to the execution resource.  For example the first-level cache
> may be private to a core and shared by the threads within the core,
> the second-level cache may be private to a socket and the third-level
> cache may be shared by all sockets.
Should there be an indicator whether a resource is shared or separate.
E.g. SMT threads (and AMD "Modules") share functional units, but
cores/sockets do not.

>   /// TargetExecutionEngineInfo base class - We assume that the target
>   /// defines a static array of TargetExecutionResourceInfo objects that
>   /// represent all of the execution resources that the target has.  As
>   /// such, we simply have to track a pointer to this array.
>   ///
>   class TargetExecutionEngineInfo {
>   public:
>     typedef ... resource_iterator;
>
>    
//===--------------------------------------------------------------------===//
>     // Resource Information
>     //
>
>     /// getResource - Get an execution resource by resource ID.
>     ///
>     const TargetExecutionResourceInfo &getResource(unsigned Resource)
const;
>
>     /// getNumResources - Return the number of resources this target has.
>     ///
>     unsigned getNumResources() const;
>
>     /// Resource iterators
>     ///
>     resource_iterator resource_begin() const;
>     resource_iterator resource_end() const;
>   };
>
> The target execution engine allows optimizers to make intelligent
> choices for cache optimization in the presence of parallelism, where
> multiple threads may be competing for cache resources.
Do you have examples on what optimizations make use of this
information? It sounds like this info is relevant to the OS scheduler
than the compiler.
> Currently the resource iterators will walk over all resources (cores,
> threads, etc.).  Alternatively, we could say that iterators walk over
> "top level" resources and contained resources must be accessed
via
> their containing resources.
Most of the time programs are not compiled for specific system
configurations (number of sockets, how many cores your processor has,
or how many threads the OS allows the program to run). Meaning this
information will usually be unknown at compile-time.
What is the intention? Pass the system configuration as flag to the
processor? Is it only available while JITing?


> Here we see one of the flaws in the model.  Because of the way
> ``Socket``, ``Module`` and ``Thread`` are defined above, we're forced
> to include a ``Module`` level even though it really doesn't make sense
> for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
> a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
> ``Module`` resources.  In reality, a ShyEnigma core has two threads
> and a ShyEnigma socket has eight cores.  At least for this SKU (more
> on that below).
Is this a restriction of TableGen? If the "Module" level is not
required, could the SubtargetInfo just return Socket->Thread. Or is
there a global requirement that every architecture has to define the
same number of level?

> An open question is how to handle different SKUs within a subtarget
> family.  We modeled the limited number of SKUs used in our products
> via multiple subtargets, so this wasn't a heavy burden for us, but a
> more robust implementation might allow for multiple ``MemorySystem``
> and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
> clear whether that's a good/necessary thing and if it is, how to
> specify it with a compiler switch.  ``-mcpu=shy-enigma
> -some-switch-to-specify-memory-and-execution-models``?  It may very
> well be sufficient to have a general system model that applies
> relatively well over multiple SKUs.
Adding more specific subtargets with more refined execution models
seem fine for me.
But is it reasonable to manage a database of all processors ever
produced in the compiler?



Michael

David Greene via llvm-dev

2018-Nov-01 20:21 UTC

head link

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Michael, thank you for commenting!  Responses inline.

Let's continue discussing and if this seems like a reasonable way to
proceed, I can start posting patches for review.

                              -David

Michael Kruse <llvmdev at meinersbur.de> writes:
> thank you for sharing the system hierarchy model. IMHO it makes a lot
> of sense, although I don't know which of today's passes would make
use
> of it. Here are my remarks.
LoopDataPrefetch would use it via the existing TTI interfaces, but I
think that's about it for now.  It's a bit of a chicken-and-egg, in that
passes won't use it if it's not there and there's no push to get it
in
because few things use it.  :)
> I am wondering how one could model the following features using this
> model, or whether they should be part of a performance model at all:
>
>  * ARM's big.LITTLE
How is this modeled in the current AArch64 .td files?  The current
design doesn't capture heterogeneity at all, not because we're not
interested but simply because our compiler captures that at a higher
level outside of LLVM.
>  * NUMA hierarchies (are the NUMA domains 'caches'?)
>
>  * Total available RAM
>
>  * remote memory (e.g. RAM on an accelerator mapped into the address space)
>
>  * scratch pad
I expect we would expand TargetMemorySystemInfo to hold different kinds
of memory-related things.  Each of these could be a memory resource.  Or
maybe we would want something that lives "next to"
TargetMemorySystemInfo.
>  * write-back / write-through write buffers
Do you mean for caches, or something else?
>  * page size
>
>  * TLB capacity
>  * constructive/destructive interference
>
(https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size)
>    Some architecture have instructions to zero entire cache lines,
> e.g. dcbz on PowerPC, but it requires the cache line to be correct.
> Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/
>
>  * Instruction cache
These could go into TargetMemorySystemInfo I think.
> Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev
> <llvm-dev at lists.llvm.org>:
>>   class TargetCacheLevelInfo {
>>     /// getWays - Return the number of ways.
>>     ///
>>     unsigned getWays() const;
>
> That is, associativity?
Yes.  Naming is certainly flexible.
> Bandwidth might be a useful addition, e.g. if a performance analysis
> tools uses the roofline model.
Yes.
>>   class TargetSoftwarePrefetcherInfo {
>>     /// Should we do software prefetching at all?
>>     ///
>>     bool isEnabled() const;
>
> isEnabled sounds like something configurable at runtime.
Currently we use it to allow some subtargets to do software prefetching
and prevent it for others.  I see how the name could be confusing
though.  Maybe ShouldDoPrefetching?
>> ``get*Distance`` APIs provide general hints to guide the software
>> prefetcher.  The software prefetcher may choose to ignore them.
>> getMinDistance and getMaxDistance act as clamps to ensure the software
>> prefetcher doesn't do something wholly inappropriate.
>>
>> Distances are specified in terms of cache lines.  The current
>> ``TargetTransformInfo`` interfaces speak in terms of instructions or
>> iterations ahead.  Both can be useful and so we may want to add
>> iteration and/or instruction distances to this interface.
>
> Would it make sense to specify a prefetch distance in bytes instead of
> cache lines? The cache line might not be known at compile-time (e.g.
> ARM big.LITTLE), but it might still make sense to do software
> prefetching.
Sure, I think that would make sense.
>> Code uses the ``getMax*Buffers`` APIs to judge whether streaming
>> should be done at all.  For example, if the number of available
>> streams greatly outweighs the hardware available, it makes little
>> sense to do streaming.  Performance will be dominated by the streams
>> that don't make use of the hardware and the streams that do make
use
>> of the hardware may actually perform worse.
>
> What count's as steam? Some processors may support streams with
> strides and/or backward stream.
Yes.  We may want some additional information here to describe the
hardware's capability.
> Is there a way on which level the number of streams are shared? For
> instance, a core might be able to track 16 streams, but if 4 threads
> are running (SMT), each can only use 4.
I suppose we could couple the streaming information to an execution
resource, similar to what is done with cache levels to express this kind
of sharing.  We haven't found a need for it but that doesn't mean it
wouldn't be useful for other/new targets.
> PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
> hardware which streams it should establish. Do the buffer counts
> include explicitly and automatically established streams? Do
> non-stream accesses (e.g. stack access) count towards
It's up to the target maintainer to decide what the numbers mean.
Obviously passes have to have some notion of what things mean.  The
thing that establishes what a "stream" is in the user program lives
outside of the system model.  It may or may not consider random stack
accesses as part of a stream.

This is definitely an area for exploration.  Since we only have machines
with two major targets, we didn't need to contend with more exotic
things.  :)
>>   class TargetMemorySystemInfo {
>>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level)
const;
>>
>>     /// getNumLevels - Return the number of cache levels this target
has.
>>     ///
>>     unsigned getNumLevels() const;
>>
>>     /// Cache level iterators
>>     ///
>>     cachelevel_iterator cachelevel_begin() const;
>>     cachelevel_iterator cachelevel_end() const;
>
> May users of this class assume that a level refers to a specific
> cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> search for a cache of a specific size?
The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
the L2 cache and so on.
>>    
//===--------------------------------------------------------------------===//
>>     // Stream Buffer Information
>>     //
>>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
>>
>>    
//===--------------------------------------------------------------------===//
>>     // Software Prefetcher Information
>>     //
>>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo()
const;
>
> Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> level? Some ISA have multiple prefetchers/prefetch instructructions
> for different levels.
Probably.  Most X86 implementations direct all data prefetches to the
same cache level so we didn't find a need to model this, but it makes
sense to allow for it.
>>   class TargetExecutionResourceInfo {
>>     /// getContained - Return information about the contained execution
>>     /// resource.
>>     ///
>>     TargetExecutionResourceInfo *getContained() const;
>>
>>     /// getNumContained - Return the number of contained execution
>>     /// resources.
>>     ///
>>     unsigned getNumContained() const;
>
> Shouldn't the level itself specify how many of resources of its there
> are, instead of its parent?
> This would make TargetExecutionEngineInfo::getNumResources() reduntant.
>
> E.g. assume that "Socket" is the outermost resource level. The
number
> of sockets in the system could be returned by its
> TargetExecutionResourceInfo instead of
> TargetExecutionEngineInfo::getNumResources().
That could work I think and would probably be a bit easier to
understand.
>>   };
>>
>> Each execution resource may *contain* other execution resources.  For
>> example, a socket may contain multiple cores and a core may contain
>> multiple hardware threads (e.g. SMT contexts).  An execution resource
>> may have cache levels associated with it.  If so, that cache level is
>> private to the execution resource.  For example the first-level cache
>> may be private to a core and shared by the threads within the core,
>> the second-level cache may be private to a socket and the third-level
>> cache may be shared by all sockets.
>
> Should there be an indicator whether a resource is shared or separate.
> E.g. SMT threads (and AMD "Modules") share functional units, but
> cores/sockets do not.
Interesting idea.  I suppose we could model that with another resource
type similar to the way caches are handled.  Then the resources could be
coupled to execution resources to express the sharing.  We hadn't found
a need for this level of detail in the work we've done but it could be
useful for lots of things.
>>   /// TargetExecutionEngineInfo base class - We assume that the target
>>   /// defines a static array of TargetExecutionResourceInfo objects
that
>>   /// represent all of the execution resources that the target has.  As
>>   /// such, we simply have to track a pointer to this array.
>>   ///
>>   class TargetExecutionEngineInfo {
>>   public:
>>     typedef ... resource_iterator;
>>
>>    
//===--------------------------------------------------------------------===//
>>     // Resource Information
>>     //
>>
>>     /// getResource - Get an execution resource by resource ID.
>>     ///
>>     const TargetExecutionResourceInfo &getResource(unsigned
Resource) const;
>>
>>     /// getNumResources - Return the number of resources this target
has.
>>     ///
>>     unsigned getNumResources() const;
>>
>>     /// Resource iterators
>>     ///
>>     resource_iterator resource_begin() const;
>>     resource_iterator resource_end() const;
>>   };
>>
>> The target execution engine allows optimizers to make intelligent
>> choices for cache optimization in the presence of parallelism, where
>> multiple threads may be competing for cache resources.
>
> Do you have examples on what optimizations make use of this
> information? It sounds like this info is relevant to the OS scheduler
> than the compiler.
Sure.  Cache blocking is one.  Let's assume an L2 cache shared among
cores.  Let's also assume the program is going to use threads within a
core.  You wouldn't want the compiler to cache block assuming the whole
size of L2, you'd want to cache block for some partition of L2 given the
execution resources the code is going to use.
>> Currently the resource iterators will walk over all resources (cores,
>> threads, etc.).  Alternatively, we could say that iterators walk over
>> "top level" resources and contained resources must be
accessed via
>> their containing resources.
>
> Most of the time programs are not compiled for specific system
> configurations (number of sockets, how many cores your processor has,
> or how many threads the OS allows the program to run). Meaning this
> information will usually be unknown at compile-time.
> What is the intention? Pass the system configuration as flag to the
> processor? Is it only available while JITing?
On our machines it is very common for customers to compile for specific
system configurations and we provide pre-canned compiler configurations
to make it convenient to do so.  Every 1% speedup matters in HPC.  :)

This certainly could be used in a JIT but that wasn't the motivation for
the design.
>> Here we see one of the flaws in the model.  Because of the way
>> ``Socket``, ``Module`` and ``Thread`` are defined above, we're
forced
>> to include a ``Module`` level even though it really doesn't make
sense
>> for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
>> a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
>> ``Module`` resources.  In reality, a ShyEnigma core has two threads
>> and a ShyEnigma socket has eight cores.  At least for this SKU (more
>> on that below).
>
> Is this a restriction of TableGen? If the "Module" level is not
> required, could the SubtargetInfo just return Socket->Thread. Or is
> there a global requirement that every architecture has to define the
> same number of level?
No, the number of levels isn't fixed.  The issue is the way that Socket
is defined:

  class Module<int numcores> : ExecutionResource<"Module",
"Core", numcores>;
  class Socket<int nummodules> : ExecutionResource<"Socket",
"Module", nummodules>;

It refers to "Module" by name.  The TableGen backend picks up on this
and connects the resources appropriately.  This is definitely something
that will need work as patches are developed.  It's possible that your
idea of e.g. shared function units above could capture this.
>> An open question is how to handle different SKUs within a subtarget
>> family.  We modeled the limited number of SKUs used in our products
>> via multiple subtargets, so this wasn't a heavy burden for us, but
a
>> more robust implementation might allow for multiple ``MemorySystem``
>> and/or ``ExecutionEngine`` models for a given subtarget.  It's not
yet
>> clear whether that's a good/necessary thing and if it is, how to
>> specify it with a compiler switch.  ``-mcpu=shy-enigma
>> -some-switch-to-specify-memory-and-execution-models``?  It may very
>> well be sufficient to have a general system model that applies
>> relatively well over multiple SKUs.
>
> Adding more specific subtargets with more refined execution models
> seem fine for me.
> But is it reasonable to manage a database of all processors ever
> produced in the compiler?
No it is not.  :)  That's why this is an open question.  We've found it
perfectly adequate to define a single model for each major processor
generation, but as I said we use a limited number of SKUs.  We will
need input from the community on this.

Michael Kruse via llvm-dev

2018-Nov-01 21:36 UTC

head link

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at
cray.com>>> > thank you for sharing the system hierarchy model. IMHO it makes a lot
> > of sense, although I don't know which of today's passes would
make use
> > of it. Here are my remarks.
>
> LoopDataPrefetch would use it via the existing TTI interfaces, but I
> think that's about it for now.  It's a bit of a chicken-and-egg, in
that
> passes won't use it if it's not there and there's no push to
get it in
> because few things use it.  :)
What kind of passes is using it in the Cray compiler?

> > I am wondering how one could model the following features using this
> > model, or whether they should be part of a performance model at all:
> >
> >  * ARM's big.LITTLE
>
> How is this modeled in the current AArch64 .td files?  The current
> design doesn't capture heterogeneity at all, not because we're not
> interested but simply because our compiler captures that at a higher
> level outside of LLVM.
AFAIK it is not handled at all. Any architecture that supports
big.LITTLE will return 0 on getCacheLineSize(). See
AArch64Subtarget::initializeProperties().

> >  * write-back / write-through write buffers
>
> Do you mean for caches, or something else?
https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies

Basically, with write-though, every store is a non-temporal store (Or
temporal stores being a write-through, depending on how to view it)



> >>   class TargetSoftwarePrefetcherInfo {
> >>     /// Should we do software prefetching at all?
> >>     ///
> >>     bool isEnabled() const;
> >
> > isEnabled sounds like something configurable at runtime.
>
> Currently we use it to allow some subtargets to do software prefetching
> and prevent it for others.  I see how the name could be confusing
> though.  Maybe ShouldDoPrefetching?
isPrefetchingProfitable()?

If it is a hardware property:
isSupported()
(ie. prefetch instruction would be a no-op on other implementations)


> > Is there a way on which level the number of streams are shared? For
> > instance, a core might be able to track 16 streams, but if 4 threads
> > are running (SMT), each can only use 4.
>
> I suppose we could couple the streaming information to an execution
> resource, similar to what is done with cache levels to express this kind
> of sharing.  We haven't found a need for it but that doesn't mean
it
> wouldn't be useful for other/new targets.
The example above is IBM's Blue Gene/Q processor, so yes, such targets do
exist.
> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to
the
> > hardware which streams it should establish. Do the buffer counts
> > include explicitly and automatically established streams? Do
> > non-stream accesses (e.g. stack access) count towards
>
> It's up to the target maintainer to decide what the numbers mean.
> Obviously passes have to have some notion of what things mean.  The
> thing that establishes what a "stream" is in the user program
lives
> outside of the system model.  It may or may not consider random stack
> accesses as part of a stream.
>
> This is definitely an area for exploration.  Since we only have machines
> with two major targets, we didn't need to contend with more exotic
> things.  :)
IMHO it would be good if passes and targets agree on an interpretation
of this number when designing the interface.

Again, from the Blue Gene/Q: What counts as stream is configurable at
runtime via a hardware register. It supports 3 settings:
* Interpret every memory access as start of a stream
* Interpret a stream when there are 2 consecutive cache misses
* Only establish streams via dcbt instructions.

> >>   class TargetMemorySystemInfo {
> >>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level)
const;
> >>
> >>     /// getNumLevels - Return the number of cache levels this
target has.
> >>     ///
> >>     unsigned getNumLevels() const;
> >>
> >>     /// Cache level iterators
> >>     ///
> >>     cachelevel_iterator cachelevel_begin() const;
> >>     cachelevel_iterator cachelevel_end() const;
> >
> > May users of this class assume that a level refers to a specific
> > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> > search for a cache of a specific size?
>
> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
> the L2 cache and so on.
Can passes rely on it?
> >>    
//===--------------------------------------------------------------------===//
> >>     // Stream Buffer Information
> >>     //
> >>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
> >>
> >>    
//===--------------------------------------------------------------------===//
> >>     // Software Prefetcher Information
> >>     //
> >>     const TargetSoftwarePrefetcherInfo
*getSoftwarePrefetcherInfo() const;
> >
> > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> > level? Some ISA have multiple prefetchers/prefetch instructructions
> > for different levels.
>
> Probably.  Most X86 implementations direct all data prefetches to the
> same cache level so we didn't find a need to model this, but it makes
> sense to allow for it.
Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
prefetch), but a dcbt instruction is necessary to establish the cache
line into the L1 cache.





> >> An open question is how to handle different SKUs within a
subtarget
> >> family.  We modeled the limited number of SKUs used in our
products
> >> via multiple subtargets, so this wasn't a heavy burden for us,
but a
> >> more robust implementation might allow for multiple
``MemorySystem``
> >> and/or ``ExecutionEngine`` models for a given subtarget.  It's
not yet
> >> clear whether that's a good/necessary thing and if it is, how
to
> >> specify it with a compiler switch.  ``-mcpu=shy-enigma
> >> -some-switch-to-specify-memory-and-execution-models``?  It may
very
> >> well be sufficient to have a general system model that applies
> >> relatively well over multiple SKUs.
> >
> > Adding more specific subtargets with more refined execution models
> > seem fine for me.
> > But is it reasonable to manage a database of all processors ever
> > produced in the compiler?
>
> No it is not.  :)  That's why this is an open question.  We've
found it
> perfectly adequate to define a single model for each major processor
> generation, but as I said we use a limited number of SKUs.  We will
> need input from the community on this.
Independently on whether subtargets for SKUs are added, could we
(also) be able to define these parameters via the command line. Like
xlc's -qcache option.

Michael

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Nov 2018 - RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Seemingly Similar Threads