thr3ads.net - llvm dev - [LLVMdev] Alias-based Loop Versioning [May 2015]

If this information is useful, please help other people find it:
Share via:

Adam Nemet

2015-May-21 16:53 UTC

[LLVMdev] Alias-based Loop Versioning

There is a work taking place by multiple people in this area and more is
expected to happen and I’d like to make sure we’re working toward a common end
goal.

I tried to collect the use-cases for run-time memory checks and the specific
memchecks required for each:

1. Loop Vectorizer: each memory access is checked against all other memory
accesses in the loop (except read vs read)
2. Loop Distribution: only memory accesses in different partitions are checked
against each other.  The loop vectorizer will add its own checks for the
vectorizable distributed loops
3. Loop Fusion: accesses from the to-be-merged loops should be checked against
each other
4. LICM: if hoisting a load, stores needs to be check.  When sinking a store,
all accesses are checked
5. Load-elimination in GVN: all *intervening* stores need to be checked.
6. Instruction scheduling for in-order targets: same as 5 

Currently only the first two are implemented.  Ashutosh has a pending review for
LICM/independent pass.  I am also working on loop-aware load-elimination
requiring memchecks (5 from the above list).

The two main approaches are whether to do this in a separate pass or to do it
locally in the passes that benefit from versioning.  I tried to collect the pros
and cons of each.

1. Separate pass

The benefit of this approach is that current passes would not have to be
modified to take advantage of the new freedom to due more independence of the
memory access operations.  AA will capture the noalias annotation of inserted by
this pass and present it to the passes.

Memchecks present an overhead at run time so one question is how we ensure that
any optimization will amortize the cost of these checks.

Depending on the optimization, answering this could be pretty involved (i.e.
almost like running the pass itself).  Consider the loop vectorizer.  In order
to answer whether versioning the loop would make it vectorizable, you’d have to
run most of the legality and profitability logic from the pass.

Also which accesses do we need to check?  Depending on the optimization we may
not need to check each access against each other access, which being quadratic
can be a significant difference.

2. Each pass performs its own versioning

Under this approach, each pass would make the calculation locally whether the
benefit of versioning outweighs the overhead of the checks.  The current Loop
Vectorizer is a good example for this.  It effectively assumes no may-alias and
if the number of required checks are under a certain threshold it assumes that
the vectorization gain will outweigh the cost of the checks.

Making decision locally is not ideal in this approach.  I.e. if we can amortize
the cost of the same checks with a combination of optimization from *multiple*
passes, neither pass would make the decision locally to version.

Also, it’s probably beneficial to perform a single loop versioning even if
multiple passes would like to check different accesses.  E.g. rather than:

    Checks_1
    /       \
   /         \
OrigLoop    Checks_2
   \         /  \
    \       /    \
     \ NoAlias_1 NoAlias_2
      \    |    /
       \   |   /
        \  |  /
          Join

But instead:

   Checks_1+Check_2
      /      \
     /        \
  OrigLoop   NoAlias_2
     \        /
      \      /
       \    /
        Join

This is effectively creating a fast-path and a slow-path version of the loop. 
We would probably need some metadata annotation so that subsequent passes could
amend the same checking block.

3. There are some more futuristic ideas like to always version fully,
disambiguating all accesses and then have the optimizers transform both the
versions of the loop.  Then a later pass would decide which checks were
necessary and what additional optimizations were done in the speculative version
of the loop .  Then finally make the cost decision of whether to keep the
speculative version along with the checks or remove them.  This would probably
need a fairly sophisticated set of metadata.

I think that a combination of 1 and 2 makes sense.  This is where we will
effectively end up after Ashutosh’s patch.  This would hopefully give us the
best of both worlds: the aggressiveness/precision of making the call locally in
complex passes and the option of simplicity/orthogonality of the separate pass.

Thoughts?

Adam

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150521/186b45fa/attachment.html>

Hal Finkel

2015-May-22 18:32 UTC

head link

[LLVMdev] Alias-based Loop Versioning

----- Original Message -----> From: "Adam Nemet" <anemet at apple.com>
> To: "Dev" <llvmdev at cs.uiuc.edu>, "Ashutosh
Nema" <Ashutosh.Nema at amd.com>, "Hal Finkel" <hfinkel
at anl.gov>
> Sent: Thursday, May 21, 2015 11:53:08 AM
> Subject: Alias-based Loop Versioning
> 
> There is a work taking place by multiple people in this area and more
> is expected to happen and I’d like to make sure we’re working toward
> a common end goal.
> 
> 
> I tried to collect the use-cases for run-time memory checks and the
> specific memchecks required for each:
> 
> 
> 1. Loop Vectorizer: each memory access is checked against all other
> memory accesses in the loop (except read vs read)
> 2. Loop Distribution: only memory accesses in different partitions
> are checked against each other. The loop vectorizer will add its own
> checks for the vectorizable distributed loops
> 3. Loop Fusion: accesses from the to-be-merged loops should be
> checked against each other
> 4. LICM: if hoisting a load, stores needs to be check. When sinking a
> store, all accesses are checked
> 5. Load-elimination in GVN: all *intervening* stores need to be
> checked.
> 6. Instruction scheduling for in-order targets: same as 5
> 
> 
> Currently only the first two are implemented. Ashutosh has a pending
> review for LICM/independent pass. I am also working on loop-aware
> load-elimination requiring memchecks (5 from the above list).
> 
> 
> The two main approaches are whether to do this in a separate pass or
> to do it locally in the passes that benefit from versioning. I tried
> to collect the pros and cons of each.
> 
> 
> 1. Separate pass
> 
> 
> The benefit of this approach is that current passes would not have to
> be modified to take advantage of the new freedom to due more
> independence of the memory access operations. AA will capture the
> noalias annotation of inserted by this pass and present it to the
> passes.
> 
> 
> Memchecks present an overhead at run time so one question is how we
> ensure that any optimization will amortize the cost of these checks.
> 
> 
> Depending on the optimization, answering this could be pretty
> involved (i.e. almost like running the pass itself). Consider the
> loop vectorizer. In order to answer whether versioning the loop
> would make it vectorizable, you’d have to run most of the legality
> and profitability logic from the pass.
> 
> 
> Also which accesses do we need to check? Depending on the
> optimization we may not need to check each access against each other
> access, which being quadratic can be a significant difference.
> 
> 
> 2. Each pass performs its own versioning
> 
> 
> Under this approach, each pass would make the calculation locally
> whether the benefit of versioning outweighs the overhead of the
> checks. The current Loop Vectorizer is a good example for this. It
> effectively assumes no may-alias and if the number of required
> checks are under a certain threshold it assumes that the
> vectorization gain will outweigh the cost of the checks.
> 
> 
> Making decision locally is not ideal in this approach. I.e. if we can
> amortize the cost of the same checks with a combination of
> optimization from *multiple* passes, neither pass would make the
> decision locally to version.
> 
> 
> Also, it’s probably beneficial to perform a single loop versioning
> even if multiple passes would like to check different accesses. E.g.
> rather than:
> 
> 
> Checks_1
> / \
> / \
> OrigLoop Checks_2
> \ / \
> \ / \
> \ NoAlias_1 NoAlias_2
> \ | /
> \ | /
> \ | /
> Join
> 
> 
> But instead:
> 
> 
> Checks_1+Check_2
> / \
> / \
> OrigLoop NoAlias_2
> \ /
> \ /
> \ /
> Join
> 
> 
> This is effectively creating a fast-path and a slow-path version of
> the loop. We would probably need some metadata annotation so that
> subsequent passes could amend the same checking block.
I think this is where we'd like to end up. A loop invocation is likely
either "nice" or not nice, and dealing with them on that basis is
probably the best performance vs. code size vs. compile time trade off. The only
exception to that I feel might be likely is for loop splitting, and possibly for
alignment checks (when we have code that accounts for that during
vectorization). That having been said, I'm not basing these statements on
data, but rather on my experience that programmers tend to write loops intended
to be hot without the kinds of dependencies for which we need to check.

We'll also need to think about how this can be done in general. Different
transformations will benefit from different checks, and collecting the logic for
this in one central location is probably not a good idea. A better idea,
perhaps, is to have each of these transformation passes split off their
profitability analysis into an actual analysis pass that can be used by the
check-insertion pass in order to rank and select the checks desirable for each
loop.

 -Hal
> 
> 
> 3. There are some more futuristic ideas like to always version fully,
> disambiguating all accesses and then have the optimizers transform
> both the versions of the loop. Then a later pass would decide which
> checks were necessary and what additional optimizations were done in
> the speculative version of the loop . Then finally make the cost
> decision of whether to keep the speculative version along with the
> checks or remove them. This would probably need a fairly
> sophisticated set of metadata.
> 
> 
> I think that a combination of 1 and 2 makes sense. This is where we
> will effectively end up after Ashutosh’s patch. This would hopefully
> give us the best of both worlds: the aggressiveness/precision of
> making the call locally in complex passes and the option of
> simplicity/orthogonality of the separate pass.
> 
> 
> Thoughts?
> 
> 
> Adam
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Das, Dibyendu

2015-May-23 10:45 UTC

head link

[LLVMdev] Alias-based Loop Versioning

I am not clear what check_1+check_2 really means. For ex: if check_2 is a subset
of check_1 and check_1+check_2 implies union, then the new check_1+check_2 may
be overly conservative resulting in lost opportunities.

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Adam Nemet
Sent: Thursday, May 21, 2015 10:23 PM
To: Dev; Nema, Ashutosh; Hal Finkel
Subject: [LLVMdev] Alias-based Loop Versioning

There is a work taking place by multiple people in this area and more is
expected to happen and I’d like to make sure we’re working toward a common end
goal.

I tried to collect the use-cases for run-time memory checks and the specific
memchecks required for each:

1. Loop Vectorizer: each memory access is checked against all other memory
accesses in the loop (except read vs read)
2. Loop Distribution: only memory accesses in different partitions are checked
against each other.  The loop vectorizer will add its own checks for the
vectorizable distributed loops
3. Loop Fusion: accesses from the to-be-merged loops should be checked against
each other
4. LICM: if hoisting a load, stores needs to be check.  When sinking a store,
all accesses are checked
5. Load-elimination in GVN: all *intervening* stores need to be checked.
6. Instruction scheduling for in-order targets: same as 5

Currently only the first two are implemented.  Ashutosh has a pending review for
LICM/independent pass.  I am also working on loop-aware load-elimination
requiring memchecks (5 from the above list).

The two main approaches are whether to do this in a separate pass or to do it
locally in the passes that benefit from versioning.  I tried to collect the pros
and cons of each.

1. Separate pass

The benefit of this approach is that current passes would not have to be
modified to take advantage of the new freedom to due more independence of the
memory access operations.  AA will capture the noalias annotation of inserted by
this pass and present it to the passes.

Memchecks present an overhead at run time so one question is how we ensure that
any optimization will amortize the cost of these checks.

Depending on the optimization, answering this could be pretty involved (i.e.
almost like running the pass itself).  Consider the loop vectorizer.  In order
to answer whether versioning the loop would make it vectorizable, you’d have to
run most of the legality and profitability logic from the pass.

Also which accesses do we need to check?  Depending on the optimization we may
not need to check each access against each other access, which being quadratic
can be a significant difference.

2. Each pass performs its own versioning

Under this approach, each pass would make the calculation locally whether the
benefit of versioning outweighs the overhead of the checks.  The current Loop
Vectorizer is a good example for this.  It effectively assumes no may-alias and
if the number of required checks are under a certain threshold it assumes that
the vectorization gain will outweigh the cost of the checks.

Making decision locally is not ideal in this approach.  I.e. if we can amortize
the cost of the same checks with a combination of optimization from *multiple*
passes, neither pass would make the decision locally to version.

Also, it’s probably beneficial to perform a single loop versioning even if
multiple passes would like to check different accesses.  E.g. rather than:

    Checks_1
    /       \
   /         \
OrigLoop    Checks_2
   \         /  \
    \       /    \
     \ NoAlias_1 NoAlias_2
      \    |    /
       \   |   /
        \  |  /
          Join

But instead:

   Checks_1+Check_2
      /      \
     /        \
  OrigLoop   NoAlias_2
     \        /
      \      /
       \    /
        Join

This is effectively creating a fast-path and a slow-path version of the loop. 
We would probably need some metadata annotation so that subsequent passes could
amend the same checking block.

3. There are some more futuristic ideas like to always version fully,
disambiguating all accesses and then have the optimizers transform both the
versions of the loop.  Then a later pass would decide which checks were
necessary and what additional optimizations were done in the speculative version
of the loop .  Then finally make the cost decision of whether to keep the
speculative version along with the checks or remove them.  This would probably
need a fairly sophisticated set of metadata.

I think that a combination of 1 and 2 makes sense.  This is where we will
effectively end up after Ashutosh’s patch.  This would hopefully give us the
best of both worlds: the aggressiveness/precision of making the call locally in
complex passes and the option of simplicity/orthogonality of the separate pass.

Thoughts?

Adam

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150523/2908fb49/attachment.html>

Hal Finkel

2015-May-23 11:22 UTC

head link

[LLVMdev] Alias-based Loop Versioning

----- Original Message -----> From: "Dibyendu Das" <Dibyendu.Das at amd.com>
> To: "Adam Nemet" <anemet at apple.com>, "Dev"
<llvmdev at cs.uiuc.edu>, "Ashutosh Nema" <Ashutosh.Nema at
amd.com>, "Hal
> Finkel" <hfinkel at anl.gov>
> Sent: Saturday, May 23, 2015 5:45:27 AM
> Subject: RE: [LLVMdev] Alias-based Loop Versioning
> 
> 
> 
> 
> I am not clear what check_1+check_2 really means. For ex: if check_2
> is a subset of check_1 and check_1+check_2 implies union, then the
> new check_1+check_2 may be overly conservative resulting in lost
> opportunities.
> 
That's correct. However, it is not clear that these opportunities are
important in practice. There are other trade-offs here as well (code size,
compile time, etc.), and FWIW, in this context code size can also be an
important performance considerations because if you have too many copies of loop
you'll affect inlining, icache locality, etc.

We'd really need data here to determine whether or not this is really a good
idea.

 -Hal
> 
> 
> 
> 
> From: llvmdev-bounces at cs.uiuc.edu
> [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Adam Nemet
> Sent: Thursday, May 21, 2015 10:23 PM
> To: Dev; Nema, Ashutosh; Hal Finkel
> Subject: [LLVMdev] Alias-based Loop Versioning
> 
> 
> 
> There is a work taking place by multiple people in this area and more
> is expected to happen and I’d like to make sure we’re working toward
> a common end goal.
> 
> 
> 
> 
> 
> I tried to collect the use-cases for run-time memory checks and the
> specific memchecks required for each:
> 
> 
> 
> 
> 
> 1. Loop Vectorizer: each memory access is checked against all other
> memory accesses in the loop (except read vs read)
> 
> 
> 2. Loop Distribution: only memory accesses in different partitions
> are checked against each other. The loop vectorizer will add its own
> checks for the vectorizable distributed loops
> 
> 
> 3. Loop Fusion: accesses from the to-be-merged loops should be
> checked against each other
> 
> 
> 4. LICM: if hoisting a load, stores needs to be check. When sinking a
> store, all accesses are checked
> 
> 
> 5. Load-elimination in GVN: all *intervening* stores need to be
> checked.
> 
> 
> 6. Instruction scheduling for in-order targets: same as 5
> 
> 
> 
> 
> 
> Currently only the first two are implemented. Ashutosh has a pending
> review for LICM/independent pass. I am also working on loop-aware
> load-elimination requiring memchecks (5 from the above list).
> 
> 
> 
> 
> 
> The two main approaches are whether to do this in a separate pass or
> to do it locally in the passes that benefit from versioning. I tried
> to collect the pros and cons of each.
> 
> 
> 
> 
> 
> 1. Separate pass
> 
> 
> 
> 
> 
> The benefit of this approach is that current passes would not have to
> be modified to take advantage of the new freedom to due more
> independence of the memory access operations. AA will capture the
> noalias annotation of inserted by this pass and present it to the
> passes.
> 
> 
> 
> 
> 
> Memchecks present an overhead at run time so one question is how we
> ensure that any optimization will amortize the cost of these checks.
> 
> 
> 
> 
> 
> Depending on the optimization, answering this could be pretty
> involved (i.e. almost like running the pass itself). Consider the
> loop vectorizer. In order to answer whether versioning the loop
> would make it vectorizable, you’d have to run most of the legality
> and profitability logic from the pass.
> 
> 
> 
> 
> 
> Also which accesses do we need to check? Depending on the
> optimization we may not need to check each access against each other
> access, which being quadratic can be a significant difference.
> 
> 
> 
> 
> 
> 2. Each pass performs its own versioning
> 
> 
> 
> 
> 
> Under this approach, each pass would make the calculation locally
> whether the benefit of versioning outweighs the overhead of the
> checks. The current Loop Vectorizer is a good example for this. It
> effectively assumes no may-alias and if the number of required
> checks are under a certain threshold it assumes that the
> vectorization gain will outweigh the cost of the checks.
> 
> 
> 
> 
> 
> Making decision locally is not ideal in this approach. I.e. if we can
> amortize the cost of the same checks with a combination of
> optimization from *multiple* passes, neither pass would make the
> decision locally to version.
> 
> 
> 
> 
> 
> Also, it’s probably beneficial to perform a single loop versioning
> even if multiple passes would like to check different accesses. E.g.
> rather than:
> 
> 
> 
> 
> 
> Checks_1
> 
> 
> / \
> 
> 
> / \
> 
> 
> OrigLoop Checks_2
> 
> 
> \ / \
> 
> 
> \ / \
> 
> 
> \ NoAlias_1 NoAlias_2
> 
> 
> \ | /
> 
> 
> \ | /
> 
> 
> \ | /
> 
> 
> Join
> 
> 
> 
> 
> 
> But instead:
> 
> 
> 
> 
> 
> Checks_1+Check_2
> 
> 
> / \
> 
> 
> / \
> 
> 
> OrigLoop NoAlias_2
> 
> 
> \ /
> 
> 
> \ /
> 
> 
> \ /
> 
> 
> Join
> 
> 
> 
> 
> 
> This is effectively creating a fast-path and a slow-path version of
> the loop. We would probably need some metadata annotation so that
> subsequent passes could amend the same checking block.
> 
> 
> 
> 
> 
> 3. There are some more futuristic ideas like to always version fully,
> disambiguating all accesses and then have the optimizers transform
> both the versions of the loop. Then a later pass would decide which
> checks were necessary and what additional optimizations were done in
> the speculative version of the loop . Then finally make the cost
> decision of whether to keep the speculative version along with the
> checks or remove them. This would probably need a fairly
> sophisticated set of metadata.
> 
> 
> 
> 
> 
> I think that a combination of 1 and 2 makes sense. This is where we
> will effectively end up after Ashutosh’s patch. This would hopefully
> give us the best of both worlds: the aggressiveness/precision of
> making the call locally in complex passes and the option of
> simplicity/orthogonality of the separate pass.
> 
> 
> 
> 
> 
> Thoughts?
> 
> 
> 
> 
> 
> Adam
> 
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - May 2015 - [LLVMdev] Alias-based Loop Versioning

[LLVMdev] Alias-based Loop Versioning

[LLVMdev] Alias-based Loop Versioning

[LLVMdev] Alias-based Loop Versioning

[LLVMdev] Alias-based Loop Versioning

Possibly Parallel Threads