Thanks Adam for listing down these points.
In addition to your points, was thinking to keep this under an option:
We can enforce memcheck dependent passes to use same memcheck which created
earlier. But its conservative because common memcheck will be superset check
(check1+check2+.....checkN).
We may keep this under an option i.e. ‘-use-common-memcheck’.
Under this Loop versioning will create a full memcheck by considering all memory
locations,
also it will mark meta-data property to loop so it can be later identified.
Other memcheck
dependent passes will look for meta data property and if versioned loop and
memcheck
already created they won't create new memcheck & new version loop.
Regards,
Ashutosh
-----Original Message-----
From: Adam Nemet [mailto:anemet at apple.com]
Sent: Friday, May 29, 2015 1:17 AM
To: Nema, Ashutosh
Cc: Hal Finkel; Das, Dibyendu; Dev
Subject: Re: [LLVMdev] Alias-based Loop Versioning
Thanks for the feedback. Sounds like that at this point in time we can’t really
settle on a single strategy. We probably want to support all of these
uses-cases:
1. A common early loop-versioning pass, probably fairly conservative initially
(e.g. if you need a single memcheck to remove all may-aliasing from a
hight-trip-count loop it’s probably a good idea to version). Even if the pass
would not be on by default, it would allow for quick experimentation and
headroom studies.
2. Pass-specific loop-versioning scheduled right before the pass. One example
here could be the LICM-specific versioning pass although currently it’s not
written that way. In this case the versioning decision is made based on the
profitability analysis of the pass. This is where Hal’s idea may come in for
designing passes where the legality/profitability analysis part is split from
the transform part.
3. Versioning within the actual transform pass. Loop vectorization and loop
distribution is the example here. LV is probably the stronger example to
illustrate why it sometimes makes sense to do this within a pass: besides failed
memchecks, LV also falls back on the original loop if the induction variable may
overflow, so the versioning is further amended by the pass.
Given this I think the way forward is to create a utility class for loop
versioning that we could use from all of these places.
The class will get the initial set of candidate pointers from LoopAccessAnalysis
just like now. It should allow its user to filter memchecks that are required
for the given transformation. After knowing the number of memchecks, the pass
can decide whether to go ahead with the versioning or not. If yes, it will emit
the checks, clone the loop and add the noalias metadata.
Optionally the class could also try to locate a prior versioning of the loop and
amend the checks and the noalias metadata. (This is the idea of my original
post to only version once and keep amending the checking block:
checks_1+checks_2.)
Adam
> On May 24, 2015, at 9:35 PM, Nema, Ashutosh <Ashutosh.Nema at
amd.com> wrote:
>
> It’s a good thought in general Adam, but I worried about following
scenarios:
>
> 1) As Dibyendu already mentioned Check1 + Check2 is not very clear. If your
intent is superset/union
> of check1 & check2 then I’m not sure it will always help passes those
needs smaller checks (i.e. loop distribution)
>
> Every pass has a different need of runtime check, i.e. vectorizer checks
each memory against all others
> except (read vs read), loop distribution checks memory in different
partition. Once we create a common
> superset memcheck how we will ensure there is no loss of opportunity for
passes those have smaller
> checks. (i.e. loop distribution)
>
> Consider vectorizer generates memcheck for A, B, C, D addresses compare
against each other and loop distribution
> generates memcheck for A,B vs C,D. Loop distribution actually don’t check
memory in same partition.
> If we rely on common memcheck (in this case its vectorizer’s memcheck) then
we will end up checking
> pointers in same partition.
>
> 2) Running legality and profitability of other optimization in loop
versioning pass may not be always helpful,
> Memcheck based optimization are scheduled in pipeline at different places.
If we run loop versioning early and
> it intern check profitability of other optimizations (i.e. vectorizer) and
it generates a version of loop & memcheck.
> Consider later some optimization changes few instructions which makes
vectoizer illegal then the cost of
> generated versioned loop & memcheck is not justified.
>
> Thanks,
> Ashutosh
>
>
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Saturday, May 23, 2015 4:53 PM
> To: Das, Dibyendu
> Cc: Adam Nemet; Dev; Nema, Ashutosh
> Subject: Re: [LLVMdev] Alias-based Loop Versioning
>
> ----- Original Message -----
>> From: "Dibyendu Das" <Dibyendu.Das at amd.com>
>> To: "Adam Nemet" <anemet at apple.com>, "Dev"
<llvmdev at cs.uiuc.edu>, "Ashutosh Nema" <Ashutosh.Nema at
amd.com>, "Hal
>> Finkel" <hfinkel at anl.gov>
>> Sent: Saturday, May 23, 2015 5:45:27 AM
>> Subject: RE: [LLVMdev] Alias-based Loop Versioning
>>
>>
>>
>>
>> I am not clear what check_1+check_2 really means. For ex: if check_2
>> is a subset of check_1 and check_1+check_2 implies union, then the
>> new check_1+check_2 may be overly conservative resulting in lost
>> opportunities.
>>
>
> That's correct. However, it is not clear that these opportunities are
important in practice. There are other trade-offs here as well (code size,
compile time, etc.), and FWIW, in this context code size can also be an
important performance considerations because if you have too many copies of loop
you'll affect inlining, icache locality, etc.
>
> We'd really need data here to determine whether or not this is really a
good idea.
>
> -Hal
>
>>
>>
>>
>>
>> From: llvmdev-bounces at cs.uiuc.edu
>> [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Adam Nemet
>> Sent: Thursday, May 21, 2015 10:23 PM
>> To: Dev; Nema, Ashutosh; Hal Finkel
>> Subject: [LLVMdev] Alias-based Loop Versioning
>>
>>
>>
>> There is a work taking place by multiple people in this area and more
>> is expected to happen and I’d like to make sure we’re working toward
>> a common end goal.
>>
>>
>>
>>
>>
>> I tried to collect the use-cases for run-time memory checks and the
>> specific memchecks required for each:
>>
>>
>>
>>
>>
>> 1. Loop Vectorizer: each memory access is checked against all other
>> memory accesses in the loop (except read vs read)
>>
>>
>> 2. Loop Distribution: only memory accesses in different partitions
>> are checked against each other. The loop vectorizer will add its own
>> checks for the vectorizable distributed loops
>>
>>
>> 3. Loop Fusion: accesses from the to-be-merged loops should be
>> checked against each other
>>
>>
>> 4. LICM: if hoisting a load, stores needs to be check. When sinking a
>> store, all accesses are checked
>>
>>
>> 5. Load-elimination in GVN: all *intervening* stores need to be
>> checked.
>>
>>
>> 6. Instruction scheduling for in-order targets: same as 5
>>
>>
>>
>>
>>
>> Currently only the first two are implemented. Ashutosh has a pending
>> review for LICM/independent pass. I am also working on loop-aware
>> load-elimination requiring memchecks (5 from the above list).
>>
>>
>>
>>
>>
>> The two main approaches are whether to do this in a separate pass or
>> to do it locally in the passes that benefit from versioning. I tried
>> to collect the pros and cons of each.
>>
>>
>>
>>
>>
>> 1. Separate pass
>>
>>
>>
>>
>>
>> The benefit of this approach is that current passes would not have to
>> be modified to take advantage of the new freedom to due more
>> independence of the memory access operations. AA will capture the
>> noalias annotation of inserted by this pass and present it to the
>> passes.
>>
>>
>>
>>
>>
>> Memchecks present an overhead at run time so one question is how we
>> ensure that any optimization will amortize the cost of these checks.
>>
>>
>>
>>
>>
>> Depending on the optimization, answering this could be pretty
>> involved (i.e. almost like running the pass itself). Consider the
>> loop vectorizer. In order to answer whether versioning the loop
>> would make it vectorizable, you’d have to run most of the legality
>> and profitability logic from the pass.
>>
>>
>>
>>
>>
>> Also which accesses do we need to check? Depending on the
>> optimization we may not need to check each access against each other
>> access, which being quadratic can be a significant difference.
>>
>>
>>
>>
>>
>> 2. Each pass performs its own versioning
>>
>>
>>
>>
>>
>> Under this approach, each pass would make the calculation locally
>> whether the benefit of versioning outweighs the overhead of the
>> checks. The current Loop Vectorizer is a good example for this. It
>> effectively assumes no may-alias and if the number of required
>> checks are under a certain threshold it assumes that the
>> vectorization gain will outweigh the cost of the checks.
>>
>>
>>
>>
>>
>> Making decision locally is not ideal in this approach. I.e. if we can
>> amortize the cost of the same checks with a combination of
>> optimization from *multiple* passes, neither pass would make the
>> decision locally to version.
>>
>>
>>
>>
>>
>> Also, it’s probably beneficial to perform a single loop versioning
>> even if multiple passes would like to check different accesses. E.g.
>> rather than:
>>
>>
>>
>>
>>
>> Checks_1
>>
>>
>> / \
>>
>>
>> / \
>>
>>
>> OrigLoop Checks_2
>>
>>
>> \ / \
>>
>>
>> \ / \
>>
>>
>> \ NoAlias_1 NoAlias_2
>>
>>
>> \ | /
>>
>>
>> \ | /
>>
>>
>> \ | /
>>
>>
>> Join
>>
>>
>>
>>
>>
>> But instead:
>>
>>
>>
>>
>>
>> Checks_1+Check_2
>>
>>
>> / \
>>
>>
>> / \
>>
>>
>> OrigLoop NoAlias_2
>>
>>
>> \ /
>>
>>
>> \ /
>>
>>
>> \ /
>>
>>
>> Join
>>
>>
>>
>>
>>
>> This is effectively creating a fast-path and a slow-path version of
>> the loop. We would probably need some metadata annotation so that
>> subsequent passes could amend the same checking block.
>>
>>
>>
>>
>>
>> 3. There are some more futuristic ideas like to always version fully,
>> disambiguating all accesses and then have the optimizers transform
>> both the versions of the loop. Then a later pass would decide which
>> checks were necessary and what additional optimizations were done in
>> the speculative version of the loop . Then finally make the cost
>> decision of whether to keep the speculative version along with the
>> checks or remove them. This would probably need a fairly
>> sophisticated set of metadata.
>>
>>
>>
>>
>>
>> I think that a combination of 1 and 2 makes sense. This is where we
>> will effectively end up after Ashutosh’s patch. This would hopefully
>> give us the best of both worlds: the aggressiveness/precision of
>> making the call locally in complex passes and the option of
>> simplicity/orthogonality of the separate pass.
>>
>>
>>
>>
>>
>> Thoughts?
>>
>>
>>
>>
>>
>> Adam
>>
>>
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory