thr3ads.net - llvm dev - [llvm-dev] RFC: System (cache, etc.) model for LLVM [Nov 2018]

If this information is useful, please help other people find it:
Share via:

Michael Kruse via llvm-dev

2018-Nov-01 21:36 UTC

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at
cray.com>>> > thank you for sharing the system hierarchy model. IMHO it makes a lot
> > of sense, although I don't know which of today's passes would
make use
> > of it. Here are my remarks.
>
> LoopDataPrefetch would use it via the existing TTI interfaces, but I
> think that's about it for now.  It's a bit of a chicken-and-egg, in
that
> passes won't use it if it's not there and there's no push to
get it in
> because few things use it.  :)
What kind of passes is using it in the Cray compiler?

> > I am wondering how one could model the following features using this
> > model, or whether they should be part of a performance model at all:
> >
> >  * ARM's big.LITTLE
>
> How is this modeled in the current AArch64 .td files?  The current
> design doesn't capture heterogeneity at all, not because we're not
> interested but simply because our compiler captures that at a higher
> level outside of LLVM.
AFAIK it is not handled at all. Any architecture that supports
big.LITTLE will return 0 on getCacheLineSize(). See
AArch64Subtarget::initializeProperties().

> >  * write-back / write-through write buffers
>
> Do you mean for caches, or something else?
https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies

Basically, with write-though, every store is a non-temporal store (Or
temporal stores being a write-through, depending on how to view it)



> >>   class TargetSoftwarePrefetcherInfo {
> >>     /// Should we do software prefetching at all?
> >>     ///
> >>     bool isEnabled() const;
> >
> > isEnabled sounds like something configurable at runtime.
>
> Currently we use it to allow some subtargets to do software prefetching
> and prevent it for others.  I see how the name could be confusing
> though.  Maybe ShouldDoPrefetching?
isPrefetchingProfitable()?

If it is a hardware property:
isSupported()
(ie. prefetch instruction would be a no-op on other implementations)


> > Is there a way on which level the number of streams are shared? For
> > instance, a core might be able to track 16 streams, but if 4 threads
> > are running (SMT), each can only use 4.
>
> I suppose we could couple the streaming information to an execution
> resource, similar to what is done with cache levels to express this kind
> of sharing.  We haven't found a need for it but that doesn't mean
it
> wouldn't be useful for other/new targets.
The example above is IBM's Blue Gene/Q processor, so yes, such targets do
exist.
> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to
the
> > hardware which streams it should establish. Do the buffer counts
> > include explicitly and automatically established streams? Do
> > non-stream accesses (e.g. stack access) count towards
>
> It's up to the target maintainer to decide what the numbers mean.
> Obviously passes have to have some notion of what things mean.  The
> thing that establishes what a "stream" is in the user program
lives
> outside of the system model.  It may or may not consider random stack
> accesses as part of a stream.
>
> This is definitely an area for exploration.  Since we only have machines
> with two major targets, we didn't need to contend with more exotic
> things.  :)
IMHO it would be good if passes and targets agree on an interpretation
of this number when designing the interface.

Again, from the Blue Gene/Q: What counts as stream is configurable at
runtime via a hardware register. It supports 3 settings:
* Interpret every memory access as start of a stream
* Interpret a stream when there are 2 consecutive cache misses
* Only establish streams via dcbt instructions.

> >>   class TargetMemorySystemInfo {
> >>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level)
const;
> >>
> >>     /// getNumLevels - Return the number of cache levels this
target has.
> >>     ///
> >>     unsigned getNumLevels() const;
> >>
> >>     /// Cache level iterators
> >>     ///
> >>     cachelevel_iterator cachelevel_begin() const;
> >>     cachelevel_iterator cachelevel_end() const;
> >
> > May users of this class assume that a level refers to a specific
> > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> > search for a cache of a specific size?
>
> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
> the L2 cache and so on.
Can passes rely on it?
> >>    
//===--------------------------------------------------------------------===//
> >>     // Stream Buffer Information
> >>     //
> >>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
> >>
> >>    
//===--------------------------------------------------------------------===//
> >>     // Software Prefetcher Information
> >>     //
> >>     const TargetSoftwarePrefetcherInfo
*getSoftwarePrefetcherInfo() const;
> >
> > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> > level? Some ISA have multiple prefetchers/prefetch instructructions
> > for different levels.
>
> Probably.  Most X86 implementations direct all data prefetches to the
> same cache level so we didn't find a need to model this, but it makes
> sense to allow for it.
Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
prefetch), but a dcbt instruction is necessary to establish the cache
line into the L1 cache.





> >> An open question is how to handle different SKUs within a
subtarget
> >> family.  We modeled the limited number of SKUs used in our
products
> >> via multiple subtargets, so this wasn't a heavy burden for us,
but a
> >> more robust implementation might allow for multiple
``MemorySystem``
> >> and/or ``ExecutionEngine`` models for a given subtarget.  It's
not yet
> >> clear whether that's a good/necessary thing and if it is, how
to
> >> specify it with a compiler switch.  ``-mcpu=shy-enigma
> >> -some-switch-to-specify-memory-and-execution-models``?  It may
very
> >> well be sufficient to have a general system model that applies
> >> relatively well over multiple SKUs.
> >
> > Adding more specific subtargets with more refined execution models
> > seem fine for me.
> > But is it reasonable to manage a database of all processors ever
> > produced in the compiler?
>
> No it is not.  :)  That's why this is an open question.  We've
found it
> perfectly adequate to define a single model for each major processor
> generation, but as I said we use a limited number of SKUs.  We will
> need input from the community on this.
Independently on whether subtargets for SKUs are added, could we
(also) be able to define these parameters via the command line. Like
xlc's -qcache option.

Michael

David Greene via llvm-dev

2018-Nov-01 21:55 UTC

head link

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Michael Kruse via llvm-dev <llvm-dev at lists.llvm.org> writes:
> Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at
cray.com>>
>> > thank you for sharing the system hierarchy model. IMHO it makes a
lot
>> > of sense, although I don't know which of today's passes
would make use
>> > of it. Here are my remarks.
>>
>> LoopDataPrefetch would use it via the existing TTI interfaces, but I
>> think that's about it for now.  It's a bit of a
chicken-and-egg, in that
>> passes won't use it if it's not there and there's no push
to get it in
>> because few things use it.  :)
>
> What kind of passes is using it in the Cray compiler?
Not sure how much I can say about that, unfortunately.
>> > I am wondering how one could model the following features using
this
>> > model, or whether they should be part of a performance model at
all:
>> >
>> >  * ARM's big.LITTLE
>>
>> How is this modeled in the current AArch64 .td files?  The current
>> design doesn't capture heterogeneity at all, not because we're
not
>> interested but simply because our compiler captures that at a higher
>> level outside of LLVM.
>
> AFAIK it is not handled at all. Any architecture that supports
> big.LITTLE will return 0 on getCacheLineSize(). See
> AArch64Subtarget::initializeProperties().
Ok.  I would like to start posting patches for review without
speculating too much on fancy/exotic things that may come later.  We
shouldn't do anything that precludes extensions but I don't want to get
bogged down in a lot of details on things related to a small number of
targets.  Let's get the really common stuff in first.  What do you
think?
>> >  * write-back / write-through write buffers
>>
>> Do you mean for caches, or something else?
>
> https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies
>
> Basically, with write-though, every store is a non-temporal store (Or
> temporal stores being a write-through, depending on how to view it)
A write-through store isn't the same things as a non-temporal store, at
least in my understanding of the term from X86 and AArch64.  A
non-temporal store bypasses the cache entirely.

I'm struggling a bit to understand how a compiler would make use of the
cache's write-back policy.
>> >>   class TargetSoftwarePrefetcherInfo {
>> >>     /// Should we do software prefetching at all?
>> >>     ///
>> >>     bool isEnabled() const;
>> >
>> > isEnabled sounds like something configurable at runtime.
>>
>> Currently we use it to allow some subtargets to do software prefetching
>> and prevent it for others.  I see how the name could be confusing
>> though.  Maybe ShouldDoPrefetching?
>
> isPrefetchingProfitable()?
Sounds good.
> If it is a hardware property:
> isSupported()
> (ie. prefetch instruction would be a no-op on other implementations)
Oh, I hadn't even thought of that possibility.
>> > Is there a way on which level the number of streams are shared?
For
>> > instance, a core might be able to track 16 streams, but if 4
threads
>> > are running (SMT), each can only use 4.
>>
>> I suppose we could couple the streaming information to an execution
>> resource, similar to what is done with cache levels to express this
kind
>> of sharing.  We haven't found a need for it but that doesn't
mean it
>> wouldn't be useful for other/new targets.
>
> The example above is IBM's Blue Gene/Q processor, so yes, such targets
do exist.
Ok.
>> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying
to the
>> > hardware which streams it should establish. Do the buffer counts
>> > include explicitly and automatically established streams? Do
>> > non-stream accesses (e.g. stack access) count towards
>>
>> It's up to the target maintainer to decide what the numbers mean.
>> Obviously passes have to have some notion of what things mean.  The
>> thing that establishes what a "stream" is in the user program
lives
>> outside of the system model.  It may or may not consider random stack
>> accesses as part of a stream.
>>
>> This is definitely an area for exploration.  Since we only have
machines
>> with two major targets, we didn't need to contend with more exotic
>> things.  :)
>
> IMHO it would be good if passes and targets agree on an interpretation
> of this number when designing the interface.
Of course.
> Again, from the Blue Gene/Q: What counts as stream is configurable at
> runtime via a hardware register. It supports 3 settings:
> * Interpret every memory access as start of a stream
> * Interpret a stream when there are 2 consecutive cache misses
> * Only establish streams via dcbt instructions.
I think we're interpreting "streaming" differently.  In this
design, a
"stream" is a sequence of memory operations that should bypass the
cache
because the data will never be reused (at least not in a timely manner).

On X86 processor the compiler has complete software control over
streaming through the use of movnt instructions.  AArch64 has a similar,
though very restricted, capability until SVE.  dcbt is more like a
prefetch than a movnt, right?

It sounds like BG/Q has a hardware prefetcher configurable by software.
I think that would fit better under a completely different resource
type.  The compiler's use of dcbt would be guided by
TargetSoftwarePrefetcherInfo which could be extended to represent BG/Q's
configurable hardware prefetcher.
>> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1)
is
>> the L2 cache and so on.
>
> Can passes rely on it?
Yes.
>> Probably.  Most X86 implementations direct all data prefetches to the
>> same cache level so we didn't find a need to model this, but it
makes
>> sense to allow for it.
>
> Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
> prefetch), but a dcbt instruction is necessary to establish the cache
> line into the L1 cache.
Yep, makes sense.
>> > Adding more specific subtargets with more refined execution models
>> > seem fine for me.  But is it reasonable to manage a database of
all
>> > processors ever produced in the compiler?
>>
>> No it is not.  :)  That's why this is an open question.  We've
found it
>> perfectly adequate to define a single model for each major processor
>> generation, but as I said we use a limited number of SKUs.  We will
>> need input from the community on this.
>
> Independently on whether subtargets for SKUs are added, could we
> (also) be able to define these parameters via the command line. Like
> xlc's -qcache option.
I think that would be very useful.

                            -David

Renato Golin via llvm-dev

2018-Nov-02 20:06 UTC

head link

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Hey,

I've been reading back the thread and there's a lot of ideas flying
around, I may have missed more than I should, but here's my view on
it.

First, I think this is a good idea.

Mapping caches is certainly interesting to general architectures, but
particularly important to massive operations like matrix multiply and
stencils can pull a lot of data into cache and sometimes thrash it if
not careful.

With scalable and larger vectors, this will be even more important.

Overall, I think this is a good idea, but the current proposal is too
detailed on the implementation and not enough on the use for me to
have a good idea how and where this will be used.

Can you describe a few situations where these new interfaces would be
used and how?

Some comments inline.

On Thu, 1 Nov 2018 at 21:56, David Greene via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Ok.  I would like to start posting patches for review without
> speculating too much on fancy/exotic things that may come later.  We
> shouldn't do anything that precludes extensions but I don't want to
get
> bogged down in a lot of details on things related to a small number of
> targets.  Let's get the really common stuff in first.  What do you
> think?
In theory, both big and little cores should have the same cache
structure, so we don't necessarily need extra descriptions for both.

In practice, sub-architectures can have multiple combinations of
big.LITTLE cores and it's simply not practical to add that to
table-gen.

-- 
cheers,
--renato

Michael Kruse via llvm-dev

2018-Nov-02 21:16 UTC

head link

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Am Do., 1. Nov. 2018 um 16:56 Uhr schrieb David Greene <dag at
cray.com>:> Ok.  I would like to start posting patches for review without
> speculating too much on fancy/exotic things that may come later.  We
> shouldn't do anything that precludes extensions but I don't want to
get
> bogged down in a lot of details on things related to a small number of
> targets.  Let's get the really common stuff in first.  What do you
> think?
I agree.
> > Again, from the Blue Gene/Q: What counts as stream is configurable at
> > runtime via a hardware register. It supports 3 settings:
> > * Interpret every memory access as start of a stream
> > * Interpret a stream when there are 2 consecutive cache misses
> > * Only establish streams via dcbt instructions.
>
> I think we're interpreting "streaming" differently.  In this
design, a
> "stream" is a sequence of memory operations that should bypass
the cache
> because the data will never be reused (at least not in a timely manner).
I understood "stream" as "prefetch stream", something that
prefetch
the data for an access A[i] in a for-loop.

I'd call "bypassing the cache because the data will never be
reuse" a
non-temporal memory access.

In the the latter interpretation, what does "number of streams" mean?
AFAIU the processer will just queue memory operations (e.g. for
writing to RAM). Is it the maximum number of operations in the queue?

> >> The intent is that getCacheLevel(0) is the L1 cache,
getCacheLevel(1) is
> >> the L2 cache and so on.
> >
> > Can passes rely on it?
>
> Yes.
Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2,
i.e. the L1P would be getCacheLevel(1)  and getCacheLevel(2) would be
L2. How would you model it instead?

Michael

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Nov 2018 - RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Possibly Parallel Threads