thr3ads.net - llvm dev - [llvm-dev] [RFC] Supporting ARM's SVE in LLVM [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2016-Nov-25 13:39 UTC

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Hi Graham,

I'll look into the patches next, but first some questions after
reading the available white papers on the net.

On 24 November 2016 at 15:39, Graham Hunter <Graham.Hunter at arm.com>
wrote:> This complex constant represents the runtime value of `n` for any scalable
type
> `<n x m x ty>`. This is primarily used to increment induction
variables and
> generate offsets.
What do you mean by "complex constant"? Surely not Complex, but this
is not really a constant either.
>From what I read around (and this is why releasing the spec isimportant, because I'm basing my reviews on guess work), is that the
length of a vector is not constant, even on the same machine.

In theory, according to a post in the ARM forums (which now I forget),
the kernel could choose the vector length per process, meaning this is
not known even at link time.

But that's ok, because the SVE instructions completely (I'm guessing,
again) bypass the need for that "constant" to be constant at all, ie,
the use of `incw/incp`. Since you can fail half-way through, the width
that you need to increment to the induction variable is not even known
at run time! Meaning, that's not a constant at all!

Example: a[i] = b[ c[i] ];
  ld1w  z0.s, p0/z, [ c, i, lsl 2 ]
  ld1w  z1.s, p0/z, [ b, z0.s, stxw 2 ]

Now, z0.s load may have failed with seg fault somewhere, and it's up
to the FFR to tell brka/brkb how to deal with this.

Each iteration will have:
  * The same vector length *per process* for accessing c[]
  * A potentially *different* vector length, *per iteration*, for accessing b[]

So, while <n x m x i32> could be constant on some vectors, even at
compile time (if we have a flag that forces certain length), it could
be unknown *per iteration* at run time.

> ```llvm
>   %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> ```
Right, this would be translated to:
  incw   x2

Now, the question is, why do we need "mul (i64 vscale, i64 4)" in the
IR?

There is no semantic analysis you can do on a value that can change on
every iteration of the loop. You can't elide, hoist, combine or const
fold.

If I got it right (from random documents on the web), `incX` relates
to a number of "increment induction" functionality. `incw` is probably
"increment W", ie. 32-bits, while `incp` is "increment
predicate", ie.
whatever the size of the predicate you use:

Examples:
  incw  x2          # increments x2 to 4*(FFR successful lanes)
  incp  x2, p0.b  # increments x2 to 1*(FFR successful lanes)

So, this IR semantics is valid for the second case, but irrelevant for
the second. Also, I'm worried that we'll end up ignoring the
multiplier altogether, if we change the vector types (from byte to
word, for example), or make the process of doing so more complex.

> The following shows the construction of a scalable vector of the form
> <start, start-2, start-4, ...>:
>
> ```llvm
>   %elt = insertelement <n x 4 x i32> undef, i32 %start, i32 0
>   %widestart = shufflevector <n x 4 x i32> %elt, <n x 4 x i32>
undef, <n x 4 x i32> zeroinitializer
>   %step = insertelement <n x 4 x i32> undef, i32 -2, i32 0
>   %widestep = shufflevector <n x 4 x i32> %step, <n x 4 x i32>
undef, <n x 4 x i32> zeroinitializer
>   %stridevec = mul <n x 4 x i32> stepvector, %widestep
>   %finalvec = add <n x 4 x i32> %widestart, %stridevec
> ```
This is really fragile and confusing, and I agree with James, an
intrinsic here would be *much* better.

Something like

%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32
%step)

cheers,
--renato

Paul Walker via llvm-dev

2016-Nov-26 11:49 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Hi Renato,

There's more comments inline below but I wanted to highlight a couple of
things.  Firstly we are not in a position to release the SVE specification at
this time, we'll send a message when it's publicly available.

Related to this I want to push this and related conversations in a different
direction.  From the outset our approach to add SVE support to LLVM IR has been
about solving the generic problem of vectorising for an unknown vector length
and then extending this to support predication.  With this in mind I would
rather the problem and its solution be discussed at the IR's level of
abstraction rather than getting into the guts of SVE.

I suggest this because trying to understand the nuances of mapping IR types to
SVE registers, first faulting loads and predicate partition instructions without
having access to the full specification is going to be painful, lead to
confusion and is ultimately unnecessary.

Your example is potentially more complex than what we'll be working towards
in the short to medium term.  So I apologise if some of my responses seem a
little dismissive but I'm keen to keep us on point whilst we work towards
getting SVE support for the simple vectorisation cases upstream.

Paul!!!
> On 25/11/2016, 13:39, "Renato Golin" <renato.golin at
linaro.org> wrote:
>
> Hi Graham,
> 
> I'll look into the patches next, but first some questions after
> reading the available white papers on the net.
> 
> On 24 November 2016 at 15:39, Graham Hunter <Graham.Hunter at
arm.com> wrote:
> > This complex constant represents the runtime value of `n` for any
scalable type
> > `<n x m x ty>`. This is primarily used to increment induction
variables and
> > generate offsets.
> 
> What do you mean by "complex constant"? Surely not Complex, but
this
> is not really a constant either.
"complex constant" is the term used within the LangRef.  Although its
value can be different across certain interfaces this does not need to be
modelled within the IR and thus for all intents and purposes we can safely
consider it to be constant.
> From what I read around (and this is why releasing the spec is
> important, because I'm basing my reviews on guess work), is that the
> length of a vector is not constant, even on the same machine.
>
> In theory, according to a post in the ARM forums (which now I forget),
> the kernel could choose the vector length per process, meaning this is
> not known even at link time.
> 
> But that's ok, because the SVE instructions completely (I'm
guessing,
> again) bypass the need for that "constant" to be constant at all,
ie,
> the use of `incw/incp`. Since you can fail half-way through, the width
> that you need to increment to the induction variable is not even known
> at run time! Meaning, that's not a constant at all!
>
> Example: a[i] = b[ c[i] ];
>   ld1w  z0.s, p0/z, [ c, i, lsl 2 ]
>   ld1w  z1.s, p0/z, [ b, z0.s, stxw 2 ]
This is not how speculation is handled within SVE.  This is not the context to
dig into this subject so perhaps we can start a separate thread.  I ask this
because speculation within the vectoriser is independent of scalable vectors.
> Now, z0.s load may have failed with seg fault somewhere, and it's up
> to the FFR to tell brka/brkb how to deal with this.
>
> Each iteration will have:
>   * The same vector length *per process* for accessing c[]
>   * A potentially *different* vector length, *per iteration*, for accessing
b[]
> 
> So, while <n x m x i32> could be constant on some vectors, even at
> compile time (if we have a flag that forces certain length), it could
> be unknown *per iteration* at run time.
I am not sure what point you are trying to make.  I agree that when doing
speculative loads the induction variable update is potentially different per
iteration, being based on the result of the speculative load.

"vscale" is not trying to represent the result of such speculation.
It's purely a constant runtime vector length multiplier.  Such a value is
required by LoopVectorize to update induction variables as describe below plus
simple interactions like extracting the last element of a scalable vector.

On a related note don't directly link "<n x m x ty>" types
to SVE registers.  Although some will map directly we do adopt a similar
approach as for non-scalable vectors in that within IR you can represent
scalable vectors that are large/smaller than those directly supported by the
target.
> > ```llvm
> >   %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> > ```
> 
> Right, this would be translated to:
>   incw   x2
> 
> Now, the question is, why do we need "mul (i64 vscale, i64 4)" in
the IR?
The answer is because that is how LoopVectorize updates its induction values.
For non-scalable vectors you would see:

    %index.next = add nuw nsw i64 %index, i64 4
    
for a VF of 4.  Why wouldn't you want to see:

    %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

for a VF of "n*4" (remembering that vscale is the "n" in
"<n x 4 x Ty>")
> There is no semantic analysis you can do on a value that can change on
> every iteration of the loop. You can't elide, hoist, combine or const
> fold.
> 
> If I got it right (from random documents on the web), `incX` relates
> to a number of "increment induction" functionality. `incw` is
probably
> "increment W", ie. 32-bits, while `incp` is "increment
predicate", ie.
> whatever the size of the predicate you use:
>
> Examples:
>   incw  x2          # increments x2 to 4*(FFR successful lanes)
>   incp  x2, p0.b  # increments x2 to 1*(FFR successful lanes)
> 
> So, this IR semantics is valid for the second case, but irrelevant for
> the second. Also, I'm worried that we'll end up ignoring the
> multiplier altogether, if we change the vector types (from byte to
> word, for example), or make the process of doing so more complex.
As mentioned above I'd rather not describe the details of SVE instructions
at this time because it'll only distract from the generic IR representation
we are aiming for.
> > The following shows the construction of a scalable vector of the form
> > <start, start-2, start-4, ...>:
> >
> > ```llvm
> >   %elt = insertelement <n x 4 x i32> undef, i32 %start, i32 0
> >   %widestart = shufflevector <n x 4 x i32> %elt, <n x 4 x
i32> undef, <n x 4 x i32> zeroinitializer
> >   %step = insertelement <n x 4 x i32> undef, i32 -2, i32 0
> >   %widestep = shufflevector <n x 4 x i32> %step, <n x 4 x
i32> undef, <n x 4 x i32> zeroinitializer
> >   %stridevec = mul <n x 4 x i32> stepvector, %widestep
> >   %finalvec = add <n x 4 x i32> %widestart, %stridevec
> > ```
> 
> This is really fragile and confusing, and I agree with James, an
> intrinsic here would be *much* better.
> 
> Something like
> 
> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32
%step)
This intrinsic matches the seriesvector instruction we original proposed. 
However, on reflection we didn't like how it allowed multiple
representations for the same constant.  Instead we prefer "stepvector"
to better allow a single canonical form for scalable vectors.

I know this doesn't preclude the use of an intrinsic, I just wanted to
highlight that doing so doesn't automatically change the surrounding IR.

We wonder if this canonical form is worth explicitly using across all vector
types to maintain a single code path (e.g. GEP related IR matching for strided
access patterns) and to allow a prettier textual IR (e.g. a non-scalable 1024bit
vector of bytes means a 128 entry constant vector to type).
> cheers,
> --renato

Renato Golin via llvm-dev

2016-Nov-26 17:07 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at arm.com>
wrote:> Related to this I want to push this and related conversations in a
different direction.  From the outset our approach to add SVE support to LLVM IR
has been about solving the generic problem of vectorising for an unknown vector
length and then extending this to support predication.  With this in mind I
would rather the problem and its solution be discussed at the IR's level of
abstraction rather than getting into the guts of SVE.
Hi Paul,

How scalable vectors operate is intimately related to how you
represent them in IR. It took a long time for the vector types to be
mapped to all available semantics. We still had to use a bunch of
intrinsics for scatter / gather, it took years to get the strided
access settled.

I understand that scalable vectors are orthogonal to all this, but as
a new concept, one that isn't available in any open source compiler I
know of, is one that will likely be very vague. Not publishing the
specs only make it worse.

I take the example of the ACLE and ARMv8.2 patches that ARM has been
pushing upstream. I have no idea what the new additions are, so I have
to take your word that they're correct. But later on, different
behaviour comes along for the same features with a comment "it didn't
work that way, let's try this". Sometimes, I don't even know what
failed, or why this new thing is better.

When that behaviour is constricted to the ARM back-end, it's ok. It's
a burden that me and Tim will have to carry, and so far, it has been a
small burden. But exposing the guts of the vectorizers (which are
already getting to a point where the need large refactorings), which
will affect all targets, need a bit more of concrete information.

The last thing we want is to keep changing how the vectorizer behaves
every six months without any concrete information as to why.

I also understand that LLVM is great at prototyping, and that's an
important step for companies like ARM to make sure their features work
as reliably as they expect in the wild, but I think adding new IR
semantics and completely refactoring core LLVM passes without a clue
is a few steps too far.

I'm not asking for a full spec. All I'm asking is for a description of
the intended basic functionality. Addressing modes, how to extract
information from unknown lanes, or if all reduction steps will be done
like `saddv`. Without that information, I cannot know what is the best
IR representation for scalable vectors or what will be the semantics
of shufffle / extract / insert operations.

> "complex constant" is the term used within the LangRef.  Although
its value can be different across certain interfaces this does not need to be
modelled within the IR and thus for all intents and purposes we can safely
consider it to be constant.
>From the LangRef:
"Complex constants are a (potentially recursive) combination of simple
constants and smaller complex constants."

There's nothing there saying it doesn't need to be modeled in IR.

> "vscale" is not trying to represent the result of such
speculation. It's purely a constant runtime vector length multiplier.  Such
a value is required by LoopVectorize to update induction variables as describe
below plus simple interactions like extracting the last element of a scalable
vector.
Right, I'm beginning to see what you mean...

The vectorizer needs that to be a constant at compile time to make
safety assurances.

For instance: for (1..N) { a[i+3] = a[i] + i; }

Has a max VF of 3. If the vectorizer is to act on that loop, it'll
have to change "vscale" to 3. If there are no loop dependencies, then
you leave as "vscale" but vectorizes anyway.

Other assurances are done for run time constants, for instance, tail
loops when changing

for (i=0; i<N; i++)   ->    for (i=0; i<N; i+=VF)

That VF is now a run-time "constant", and the vectorizer needs to see
it as much, otherwise it can't even test for validity.

So, the vectorizer will need to be taught two things:

1. "vscale" is a run time constant, and for the purpose of validity,
can be shrunk to any value down to two. If the value is shrunk, the
new compile time constant replaces vscale.

2. The cost model will *have* to treat "vscale" as an actual compile
time constant. This could come from a target feature, overriden by a
command line flag but there has to be a default, which I'd assume is
4, given that it's the lowest length.

>     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>
> for a VF of "n*4" (remembering that vscale is the "n"
in "<n x 4 x Ty>")
I see what you mean.

Quick question: Since you're saying "vscale" is an unknown
constant,
why not just:

   %index.next = add nuw nsw i64 %index, i64 vscale

All scalable operations will be tied up by the predication vector
anyway, and you already know what the vector type size is anyway.

The only worry is about providing redundant information that could go
stale and introduce bugs.

I'm assuming the vectorizer will *have* to learn about the compulsory
predication and build those vectors, or the back-end will have to
handle them, and it can get ugly.

>> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start,
i32 %step)
>
> This intrinsic matches the seriesvector instruction we original proposed. 
However, on reflection we didn't like how it allowed multiple
representations for the same constant.
Can you expand how this allows multiple representations for the same constant?

This is a series, with a start and a step, and will only be identical
to another which has the same start and step.

Just like C constants can "appear" different...

const int foo = 4;
const int bar = foo;
const int baz = 2 + 2;

> I know this doesn't preclude the use of an intrinsic, I just wanted to
highlight that doing so doesn't automatically change the surrounding IR.
I don't mind IR changes, I'm just trying to understand the need for it.

Normally, what we did in the past for some things was to add
intrinsics and then, if it's clear a native IR construct would be
better, we change it.

At least the intrinsic can be easily added without breaking
compatibility with anything, and since we're in prototyping phase
anyway, changing the IR would be the worst idea.

cheers,
--renato

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Nov 2016 - [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Apparently Analagous Threads