I mistakenly not cc’ed the list.
On Oct 23, 2013, at 8:28 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 23 October 2013 14:13, Arnold <aschwaighofer at apple.com> wrote:
>> What I am proposing for interleaved data vectorization would not
transform the loop before vectorization (because there is not much you can do in
the general case). There would not be a need for any retries.
>
> I know, I digressed.
>
>
>> We want to run the loop vectorizer late so we can't rely on a other
passes to undo the canonicalisation we have done.
>
> I didn't mean to change the code and annotate, but to just annotate the
unchanged blocks in case we want to try again, we can avoid re-doing the same
analysis.
>
> I've noticed, on the debug-only, that the loop vectorizer passes twice
on the same loops, and we even check for validity of previously auto-vectorized
loops! This is already a big waste, and if there was a way to avoid that, we
could use the saved time for more complex (maybe iterative, speculative)
analysis without increasing the compilation time.
>
Iteresting. This is a bug. Since we moved the loop vectorizer out of the
scc-module pass manager we should only run once on every new loop. The
vectorized loop is a new loop so runOnLoop will probably be called again. But
the Hint we put in the IR should prevent us from analyzing such loops.
I think what you might be seeing is a bug in LoopVectorizerHints:
/// Mark the loop L as already vectorized by setting the width to 1.
void setAlreadyVectorized(Loop *L)
We should set width *and* unroll to one. Because this is how we use it:
LoopVectorizeHints Hints(L, DisableUnrolling);
if (Hints.Width == 1 && Hints.Unroll == 1) {
DEBUG(dbgs() << "LV: Not vectorizing.\n");
return false;
}
After doing this no vectorized loop should be analyzed again. If not I very much
like to know.
http://llvm.org/bugs/show_bug.cgi?id=17662
Thanks,
Arnold
> This is a completely different subject to what we were talking about, that
was a bit of a handbrake turn, sorry about that. ;)
>
> cheers,
> --renato
On Oct 23, 2013, at 8:13 AM, Arnold <aschwaighofer at apple.com> wrote:
> Hi Renato,
>
>
> On Oct 23, 2013, at 5:40 AM, Renato Golin <renato.golin at
linaro.org> wrote:
>
>> On 22 October 2013 17:31, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
>> If you wanted to tackle the general problem of vectorization of
interleaved data I am not sure that much of Hal’s code can be immediately reused
for the problem.
>>
>> Hi Arnold,
>>
>> I agree with you, but some of the cases where a stride would work,
Hal's patch would re-roll, and if I base my analysis on what was there
before, I won't detect it.
>>
> You would detect it doing interleaved data in the vectorizer and generate
exactly the code that GCC does.
>
> Which granted will be less efficient than if we reroll and vectorize in
which case we should see a speed up of 8 (8 x i8).
>
> I have not looked at whether Hal's code could be improved to handle
your uncanonical example.
>
>> Since Hal hinted on leaving that as an API for the vectorizer, I
thought that maybe it'd be worth running that first, and delegating the
vectorization to other, simpler, parts of the vectorizer.
> Yes we could run his patch before the vectorizers because otherwise in the
current pipeline the code would not survive the slp vectorizer.
> This is a phase ordering issue.
>>
>> As an example, I created two loops:
>>
>> One, unrolled, with stride access:
>> for (i=0; i<MAX; i++) {
>> a[i] = OFFSET - b[i] - DELTA;
>> a[i+1] = OFFSET - b[i+1] - DELTA;
>> a[i+2] = OFFSET - b[i+2] - DELTA;
>> }
>>
>> One re-rolled, as it would, with Hal's patch:
>> for (i=0; i<MAX*3; i++) {
>> a[i] = OFFSET - b[i] - DELTA;
>> }
>>
>> The first loop doesn't get vectorized, because there are too many
PHIs, but the second does.
>>
>> So, Hal's patch will deal with the most basic strided access, so
that we can focus on the more complex patterns.
>>
>> There is one slight issue with all this (APIs, and re-tries), and
it's that there are already too much analysis going on in the vectorizer,
and you don't want to run the canVectorize() too many times, so calling
re-roll on failure and trying again, then calling stride-transform and trying
again on loops that would never benefit from those treatments might not be a
good idea.
>
> What I am proposing for interleaved data vectorization would not transform
the loop before vectorization (because there is not much you can do in the
general case). There would not be a need for any retries.
>
> You can do all the analysis using SCEVS and the transforms in the
vectorizer by treating strided accesses specially.
>
> The example of transforming the loop to a simpler form I gave in our
discussion was as an illustration. I did say "virtually" we make the
loop as if it look like the simpler /canonical form. We already to this with
unit stride pointer loops.
>>
>> To deal with this, I thought we could have a table with a list of
problems and how to solve them, from cheaper to more expensive analysis. So, say
I found a loop with strided access, and the general error from a more basic
analysis is to say that we found an "unidentified PHI". We then mark
the loop with some metadata to that effect, and the next iteration, it'll
try to transform the loop before vectorizing in a way that will increase the
chances of it getting vectorized.
>>
>> Moreover, such a table (and associated metadata), could also be an
easier way to annotate vectorization failures, and even used by tools to help
developers fix their code.
>>
>>
>
>
> I think I would not want to modify IR speculatively for analysis to get a
canonical form that we can handle.
> And it is also not necessary for the problem at hand.
> SCEV already provides us with a canonical view of the loop.
>
>
> I agree that we have to do something about the complexity we have in the
loop vectorizer before we add any more significant body of code to it.
>
> I am not convinced that iteratively speculatively cannonicalizing the loop
is the right answer.
>
> We want to run the loop vectorizer late so we can't rely on a other
passes to undo the canonicalisation we have done.
>
> Thanks,
> Arnold
>
>>
>> You want to split the set of load and store instructions with non-unit
stride accesses into groups where each group contains useful locality (the
members of a group are adjacent).
>>
>> Thanks again for the how-to, they're helping me get up to speed
with the vectorizer again.
>>
>> cheers,
>> --renato