thr3ads.net - llvm dev - [llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Vikram TV via llvm-dev

2016-Jun-23 17:28 UTC

[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

On Thu, Jun 23, 2016 at 9:54 AM, Adam Nemet <anemet at apple.com> wrote:
>
> On Jun 9, 2016, at 9:21 AM, Vikram TV via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> On Wed, Jun 8, 2016 at 10:20 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> ------------------------------
>>
>> *From: *"Vikram TV via llvm-dev" <llvm-dev at
lists.llvm.org>
>> *To: *"DEV" <llvm-dev at lists.llvm.org>
>> *Sent: *Wednesday, June 8, 2016 2:58:17 AM
>> *Subject: *[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
>>
>> Hi,
>>
>> This is a proposal about implementing an analysis that calculates loop
>> cost based on cache data. The primary motivation for implementing this
is
>> to write profitability measures for cache related optimizations like
>> interchange, fusion, fission, pre-fetching and others.
>>
>> I have implemented a prototypical version at
>> http://reviews.llvm.org/D21124.
>>
>> I quick comment on your pass: You'll also want to add the ability
to get
>> the loop trip count from profiling information from BFI when available
(by
>> dividing the loop header frequency by the loop predecessor-block
>> frequencies).
>>
> Sure. I will add it.
>
>>
>> The patch basically creates groups of references that would lie in the
>> same cache line. Each group is then analysed with respect to innermost
>> loops considering cache lines. Penalty for the reference is:
>> a. 1, if the reference is invariant with the innermost loop,
>> b. TripCount for non-unit stride access,
>> c. TripCount / CacheLineSize for a unit-stride access.
>> Loop Cost is then calculated as the sum of the reference penalties
times
>> the product of the loop bounds of the outer loops. This loop cost can
then
>> be used as a profitability measure for cache reuse related
optimizations.
>> This is just a brief description; please refer to [1] for more details.
>>
>> This sounds like the right approach, and is certainly something we
need.
>> One question is how the various transformations you name above (loop
>> fusion, interchange, etc.) will use this analysis to evaluate the
effect of
>> the transformation they are considering performing. Specifically, they
>> likely need to do this without actually rewriting the IR
(speculatively).
>>
> For Interchange: Loop cost can be used to build a required ordering and
> interchange can try to match the nearest possible ordering. Few internal
> kernels, matmul are working fine with this approach.
> For Fission/Fusion: Loop cost can provide the cost before and after
> fusion. Loop cost for a fused loop can be obtained by building reference
> groups by assuming a fused loop. Similar case with fission but with
> partitions.
> These require no prior IR changes.
>
>
> Hi Vikram,
>
> Is the analysis result specific to a loop nest or to a loop nest together
> with a set of reference groups?
>The result is specific to each loop in the loop nest and the calculations
are based on the references in the loop nest.
>
> Adam
>
>
>> A primary drawback in the above patch is the static use of Cache Line
>> Size. I wish to get this data from tblgen/TTI and I am happy to submit
>> patches on it.
>>
>> Yes, this sounds like the right direction. The targets obviously need
to
>> provide this information.
>>
>>
>> Finally, if the community is interested in the above work, I have the
>> following work-plan:
>> a. Update loop cost patch to a consistent, commit-able state.
>> b. Add cache data information to tblgen/TTI.
>> c. Rewrite Loop Interchange profitability to start using Loop Cost.
>>
>> Great!
>>
>> This does not affect loop interchange directly, but one thing we need
to
>> also consider is the number of prefetching streams supported by the
>> hardware prefetcher on various platforms. This is generally, per
thread,
>> some number around 5 to 10. Splitting loops requiring more streams than
>> supported, and preventing fusion from requiring more streams than
>> supported, is also an important cost metric. Of course, so is ILP and
other
>> factors.
>>
> Cache based cost computation is one of the metric and yes, more of other
> metrics need to be added.
>
>>
>>  -Hal
>>
>>
>> Please share your valuable thoughts.
>>
>> Thank you.
>>
>> References:
>> [1] [Carr-McKinley-Tseng]
>> http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf
>>
>> --
>>
>> Good time...
>> Vikram TV
>> CompilerTree Technologies
>> Mysore, Karnataka, INDIA
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
> --
>
> Good time...
> Vikram TV
> CompilerTree Technologies
> Mysore, Karnataka, INDIA
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>

-- 

Good time...
Vikram TV
CompilerTree Technologies
Mysore, Karnataka, INDIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160623/0c777f23/attachment-0001.html>

Adam Nemet via llvm-dev

2016-Jun-23 18:04 UTC

head link

[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

> On Jun 23, 2016, at 10:28 AM, Vikram TV <vikram.tarikere at
gmail.com> wrote:
> 
> 
> On Thu, Jun 23, 2016 at 9:54 AM, Adam Nemet <anemet at apple.com
<mailto:anemet at apple.com>> wrote:
> 
>> On Jun 9, 2016, at 9:21 AM, Vikram TV via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> 
>> 
>> On Wed, Jun 8, 2016 at 10:20 PM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
>> 
>> From: "Vikram TV via llvm-dev" <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>>
>> To: "DEV" <llvm-dev at lists.llvm.org <mailto:llvm-dev
at lists.llvm.org>>
>> Sent: Wednesday, June 8, 2016 2:58:17 AM
>> Subject: [llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
>> 
>> Hi,
>> 
>> This is a proposal about implementing an analysis that calculates loop
cost based on cache data. The primary motivation for implementing this is to
write profitability measures for cache related optimizations like interchange,
fusion, fission, pre-fetching and others.
>> 
>> I have implemented a prototypical version at
http://reviews.llvm.org/D21124 <http://reviews.llvm.org/D21124>.
>> I quick comment on your pass: You'll also want to add the ability
to get the loop trip count from profiling information from BFI when available
(by dividing the loop header frequency by the loop predecessor-block
frequencies).
>> Sure. I will add it. 
>> 
>> The patch basically creates groups of references that would lie in the
same cache line. Each group is then analysed with respect to innermost loops
considering cache lines. Penalty for the reference is:
>> a. 1, if the reference is invariant with the innermost loop,
>> b. TripCount for non-unit stride access,
>> c. TripCount / CacheLineSize for a unit-stride access.
>> Loop Cost is then calculated as the sum of the reference penalties
times the product of the loop bounds of the outer loops. This loop cost can then
be used as a profitability measure for cache reuse related optimizations. This
is just a brief description; please refer to [1] for more details.
>> This sounds like the right approach, and is certainly something we
need. One question is how the various transformations you name above (loop
fusion, interchange, etc.) will use this analysis to evaluate the effect of the
transformation they are considering performing. Specifically, they likely need
to do this without actually rewriting the IR (speculatively).
>> For Interchange: Loop cost can be used to build a required ordering and
interchange can try to match the nearest possible ordering. Few internal
kernels, matmul are working fine with this approach.
>> For Fission/Fusion: Loop cost can provide the cost before and after
fusion. Loop cost for a fused loop can be obtained by building reference groups
by assuming a fused loop. Similar case with fission but with partitions.
>> These require no prior IR changes.
> 
> Hi Vikram,
> 
> Is the analysis result specific to a loop nest or to a loop nest together
with a set of reference groups?
> The result is specific to each loop in the loop nest and the calculations
are based on the references in the loop nest.
Sorry can you please rephrase/elaborate, I don’t understand what you mean. 
Analysis results are retained unless a transformation invalidates them.  My
question is how the reference groups affect all this.

You could probably describe how you envision this analysis would be used with
something like Loop Fusion.

Thanks,
Adam
> 
> Adam
> 
>> 
>> A primary drawback in the above patch is the static use of Cache Line
Size. I wish to get this data from tblgen/TTI and I am happy to submit patches
on it.
>> Yes, this sounds like the right direction. The targets obviously need
to provide this information.
>> 
>> Finally, if the community is interested in the above work, I have the
following work-plan:
>> a. Update loop cost patch to a consistent, commit-able state.
>> b. Add cache data information to tblgen/TTI.
>> c. Rewrite Loop Interchange profitability to start using Loop Cost.
>> Great!
>> 
>> This does not affect loop interchange directly, but one thing we need
to also consider is the number of prefetching streams supported by the hardware
prefetcher on various platforms. This is generally, per thread, some number
around 5 to 10. Splitting loops requiring more streams than supported, and
preventing fusion from requiring more streams than supported, is also an
important cost metric. Of course, so is ILP and other factors.
>> Cache based cost computation is one of the metric and yes, more of
other metrics need to be added.
>> 
>>  -Hal
>> 
>> Please share your valuable thoughts.
>> 
>> Thank you.
>> 
>> References:
>> [1] [Carr-McKinley-Tseng]
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf
<http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf>
>> 
>> -- 
>> 
>> Good time...
>> Vikram TV
>> CompilerTree Technologies
>> Mysore, Karnataka, INDIA
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>> 
>> 
>> 
>> -- 
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> 
>> 
>> -- 
>> 
>> Good time...
>> Vikram TV
>> CompilerTree Technologies
>> Mysore, Karnataka, INDIA
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> 
> -- 
> 
> Good time...
> Vikram TV
> CompilerTree Technologies
> Mysore, Karnataka, INDIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160623/70ccc020/attachment.html>

Vikram TV via llvm-dev

2016-Jun-23 19:06 UTC

head link

[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

On Thu, Jun 23, 2016 at 11:34 PM, Adam Nemet <anemet at apple.com> wrote:
>
>
>
>> Hi Vikram,
>>
>> Is the analysis result specific to a loop nest or to a loop nest
together
>> with a set of reference groups?
>>
> The result is specific to each loop in the loop nest and the calculations
> are based on the references in the loop nest.
>
>
> Sorry can you please rephrase/elaborate, I don’t understand what you
> mean.  Analysis results are retained unless a transformation invalidates
> them.  My question is how the reference groups affect all this.
>
Sorry, I now understood that you meant llvm's analysis result (I thought
how the analysis calculates the result). The analysis result is specific to
each loop in its loop nest. Reference groups are necessary during cost
computation only.
>
> You could probably describe how you envision this analysis would be used
> with something like Loop Fusion.
>
For Loop Interchange, the cost calculated for each loop can provide a
desired ordering of the loops. Loop interchange can start interchanging the
loops to match the desired order iff interchange legality check passes.
Eg: For matrix multiplication case,
       for (i = 0 to 5000)
         for (j = 0 to 5000)
           for (k = 0 to 5000)
             C[i,j] = C[i,j] + A[i,k]* B[k,j]
       Here, based on cache line size and the cache penalties for the
references (please refer to first mail of the thread or the reference paper
for different penalties), following costs result:
       Loop i cost: 2.501750e+11
Loop j cost: 6.256252e+10
Loop k cost: 1.563688e+11
Loop cost here is nothing but the total cache penalities due to all
references, including penalties due to outer loops. So lower costing loop
is a better fit for innermost loop because it has better cache locality.
This would result in the desired loop order: i, k, j; and LoopInterchange
can interchange 'j' and 'k' loops given the legality check
passes (a
nearest ordering for which legality is correct is also an optimal loop
ordering).

For fusion/fission, loop costs due to cache can be calculated by assigning
cache penalties for references with respect to loops before and after
fusion/fission. The costs will be a profitability measure for
fusion/fission.

I think use case description was very brief in a previous mail. So I have
elaborated this time. More details can also be found in
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf.

Thanks

Good time...
Vikram TV
CompilerTree Technologies
Mysore, Karnataka, INDIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160624/ce6ec1a2/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Jun 2016 - [Proposal][RFC] Cache aware Loop Cost Analysis

[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

Possibly Parallel Threads