thr3ads.net - llvm dev - [llvm-dev] RFC: System (cache, etc.) model for LLVM [Nov 2018]

If this information is useful, please help other people find it:
Share via:

Michael Kruse via llvm-dev

2018-Nov-02 21:16 UTC

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Am Do., 1. Nov. 2018 um 16:56 Uhr schrieb David Greene <dag at
cray.com>:> Ok.  I would like to start posting patches for review without
> speculating too much on fancy/exotic things that may come later.  We
> shouldn't do anything that precludes extensions but I don't want to
get
> bogged down in a lot of details on things related to a small number of
> targets.  Let's get the really common stuff in first.  What do you
> think?
I agree.
> > Again, from the Blue Gene/Q: What counts as stream is configurable at
> > runtime via a hardware register. It supports 3 settings:
> > * Interpret every memory access as start of a stream
> > * Interpret a stream when there are 2 consecutive cache misses
> > * Only establish streams via dcbt instructions.
>
> I think we're interpreting "streaming" differently.  In this
design, a
> "stream" is a sequence of memory operations that should bypass
the cache
> because the data will never be reused (at least not in a timely manner).
I understood "stream" as "prefetch stream", something that
prefetch
the data for an access A[i] in a for-loop.

I'd call "bypassing the cache because the data will never be
reuse" a
non-temporal memory access.

In the the latter interpretation, what does "number of streams" mean?
AFAIU the processer will just queue memory operations (e.g. for
writing to RAM). Is it the maximum number of operations in the queue?

> >> The intent is that getCacheLevel(0) is the L1 cache,
getCacheLevel(1) is
> >> the L2 cache and so on.
> >
> > Can passes rely on it?
>
> Yes.
Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2,
i.e. the L1P would be getCacheLevel(1)  and getCacheLevel(2) would be
L2. How would you model it instead?

Michael

David Greene via llvm-dev

2018-Nov-05 16:25 UTC

head link

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Michael Kruse <llvmdev at meinersbur.de> writes:
>> > Again, from the Blue Gene/Q: What counts as stream is configurable
at
>> > runtime via a hardware register. It supports 3 settings:
>> > * Interpret every memory access as start of a stream
>> > * Interpret a stream when there are 2 consecutive cache misses
>> > * Only establish streams via dcbt instructions.
>>
>> I think we're interpreting "streaming" differently.  In
this design, a
>> "stream" is a sequence of memory operations that should
bypass the cache
>> because the data will never be reused (at least not in a timely
manner).
>
> I understood "stream" as "prefetch stream", something
that prefetch
> the data for an access A[i] in a for-loop.
>
> I'd call "bypassing the cache because the data will never be
reuse" a
> non-temporal memory access.
Yes, I agree the terminology is confusing.  I used the term "stream"
in
the sense of stream processing
(https://en.wikipedia.org/wiki/Stream_processing).
The programming model is very different, of course, but the idea of a
stream of data that is acted upon and then essentially discarded is
similar.
> In the the latter interpretation, what does "number of streams"
mean?
> AFAIU the processer will just queue memory operations (e.g. for
> writing to RAM). Is it the maximum number of operations in the queue?
On X86 NT writes are to so-called "write-combined memory."
Hardware-wise, that translates to some number of buffers where stores
are collected and merged, resulting in a (hopefully) single cache
line-sized write to memory that aggregates some number of individual
stores.  The number of buffers limits the number of independent sets of
stores that can be active.  For example, if the hardware has four such
buffers, I can do this and be fine:

for (...)
  A[i] = ...
  B[i] = ...
  C[i] = ...
  D[i] = ...

The sequence of (non-temporal) stores to each array will map to separate
hardware buffers and be write-combined in the way one would expect.  But
this causes a problem:

for (...)
  A[i] = ...
  B[i] = ...
  C[i] = ...
  D[i] = ...
  E[i] = ...

If all of the stores are non-temporal, then at least two of the array
store sequences will interfere with each other in the write-combining
buffers and will force early flushes of the buffer, effectively turning
them into single-store writes to memory.  That's bad.

Maybe the proper name for this concept is simply
"WriteCombiningBuffer."
I'm not sure if some other architecture might have a concept of store
buffers that does something other than write-combining, so I was trying
to use a fairly generic name to mean, "some compiler-controlled hardware
buffer."

There's a similar concept on the load side though I don't know if any
existing processors actually implement things that way.  I know of
(academic) architectures where prefetches fill independent prefetch
buffers and one wouldn't want to prefetch too many different things
because they would start filling each others' buffers.  That kind of
behavior could be captured by this model.

The key factor is contention.  There's a limited hardware memory buffer
resource and compilers have to be careful not to oversubscribe it.  I
don't know what the right name for it is.  Probably there will be more
than one such type of resource for some architectures, so for now maybe
we just model write-combining buffers and leave it at that.  If other
such resources pop up we can model them with different names.

I think that's what we should do.
>> >> The intent is that getCacheLevel(0) is the L1 cache,
getCacheLevel(1) is
>> >> the L2 cache and so on.
>> >
>> > Can passes rely on it?
>>
>> Yes.
>
> Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2,
> i.e. the L1P would be getCacheLevel(1)  and getCacheLevel(2) would be
> L2. How would you model it instead?
That seems ok to me.  As I understand it, L1P is a little awkward in
that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
really a prefetch buffer, right?  One wouldn't do, say, cache blocking
for L1P.  In that sense maybe modeling it as a cache level isn't the
right thing.

How does software make use of L1P?  I understand compilers can insert
data prefetches and the data resides in L1P, presumably until it is
accessed and then it moves to L1.  I suppose the size of L1P could
determine how aggressively compilers prefetch.  Is that the idea or are
you thinking of something else?

                            -David

Michael Kruse via llvm-dev

2018-Nov-07 23:26 UTC

head link

[llvm-dev] RFC: System (cache, etc.) model for LLVM

Am Mo., 5. Nov. 2018 um 10:26 Uhr schrieb David Greene <dag at
cray.com>:> Yes, I agree the terminology is confusing.  I used the term
"stream" in
> the sense of stream processing
(https://en.wikipedia.org/wiki/Stream_processing).
> The programming model is very different, of course, but the idea of a
> stream of data that is acted upon and then essentially discarded is
> similar.
>
> > In the the latter interpretation, what does "number of
streams" mean?
> > AFAIU the processer will just queue memory operations (e.g. for
> > writing to RAM). Is it the maximum number of operations in the queue?
>
> On X86 NT writes are to so-called "write-combined memory."
> Hardware-wise, that translates to some number of buffers where stores
> are collected and merged, resulting in a (hopefully) single cache
> line-sized write to memory that aggregates some number of individual
> stores.  The number of buffers limits the number of independent sets of
> stores that can be active.  For example, if the hardware has four such
> buffers, I can do this and be fine:
>
> for (...)
>   A[i] = ...
>   B[i] = ...
>   C[i] = ...
>   D[i] = ...
>
> The sequence of (non-temporal) stores to each array will map to separate
> hardware buffers and be write-combined in the way one would expect.  But
> this causes a problem:
>
> for (...)
>   A[i] = ...
>   B[i] = ...
>   C[i] = ...
>   D[i] = ...
>   E[i] = ...
>
> If all of the stores are non-temporal, then at least two of the array
> store sequences will interfere with each other in the write-combining
> buffers and will force early flushes of the buffer, effectively turning
> them into single-store writes to memory.  That's bad.
>
> Maybe the proper name for this concept is simply
"WriteCombiningBuffer."
> I'm not sure if some other architecture might have a concept of store
> buffers that does something other than write-combining, so I was trying
> to use a fairly generic name to mean, "some compiler-controlled
hardware
> buffer."
>
> There's a similar concept on the load side though I don't know if
any
> existing processors actually implement things that way.  I know of
> (academic) architectures where prefetches fill independent prefetch
> buffers and one wouldn't want to prefetch too many different things
> because they would start filling each others' buffers.  That kind of
> behavior could be captured by this model.
>
> The key factor is contention.  There's a limited hardware memory buffer
> resource and compilers have to be careful not to oversubscribe it.  I
> don't know what the right name for it is.  Probably there will be more
> than one such type of resource for some architectures, so for now maybe
> we just model write-combining buffers and leave it at that.  If other
> such resources pop up we can model them with different names.
>
> I think that's what we should do.
Thank you for the detailed explanation. We could use a notion of
"sustainable stream", i.e. the maximum number of (consecutive?)
read/write streams that a processor can support before a
disproportional loss in performance happens. This is oblivious to the
reason why that performance loss happens, be it write combining
buffers or prefetch streams. If there multiple such bottlenecks, it
would be the minimum of such streams. At the moment I cannot think of
an optimization where the difference matters (which doesn't mean there
isn't a case where it does).

> That seems ok to me.  As I understand it, L1P is a little awkward in
> that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
> really a prefetch buffer, right?  One wouldn't do, say, cache blocking
> for L1P.  In that sense maybe modeling it as a cache level isn't the
> right thing.
The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
indeed makes no sense.

But when optimizing for it, I could not just ignore it. However, maybe
we should leave it out for our API consideration. The Blue Gene/Q is
phasing out and I know no other architecture which has this such a
cache hierarchy.

> How does software make use of L1P?  I understand compilers can insert
> data prefetches and the data resides in L1P, presumably until it is
> accessed and then it moves to L1.  I suppose the size of L1P could
> determine how aggressively compilers prefetch.  Is that the idea or are
> you thinking of something else?
I declared streams for the CPU to prefetch (which 'run' at different
speeds over the memory), which, at some point in time I can assume to
be in the L1P cache. Using the dcbt instruction, the cache line can be
lifted from the L1P to the L1 cache, a fixed number of cycles in
advance. If the cache line had to be prefetched from L2, the
prefetch/access latency would be longer (24 cycles vs 82 cycles).

Michael

llvm dev - Nov 2018 - RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM

[llvm-dev] RFC: System (cache, etc.) model for LLVM