thr3ads.net - llvm dev - [llvm-dev] [RFC] Supporting ARM's SVE in LLVM [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2016-Nov-26 17:07 UTC

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at arm.com>
wrote:> Related to this I want to push this and related conversations in a
different direction.  From the outset our approach to add SVE support to LLVM IR
has been about solving the generic problem of vectorising for an unknown vector
length and then extending this to support predication.  With this in mind I
would rather the problem and its solution be discussed at the IR's level of
abstraction rather than getting into the guts of SVE.
Hi Paul,

How scalable vectors operate is intimately related to how you
represent them in IR. It took a long time for the vector types to be
mapped to all available semantics. We still had to use a bunch of
intrinsics for scatter / gather, it took years to get the strided
access settled.

I understand that scalable vectors are orthogonal to all this, but as
a new concept, one that isn't available in any open source compiler I
know of, is one that will likely be very vague. Not publishing the
specs only make it worse.

I take the example of the ACLE and ARMv8.2 patches that ARM has been
pushing upstream. I have no idea what the new additions are, so I have
to take your word that they're correct. But later on, different
behaviour comes along for the same features with a comment "it didn't
work that way, let's try this". Sometimes, I don't even know what
failed, or why this new thing is better.

When that behaviour is constricted to the ARM back-end, it's ok. It's
a burden that me and Tim will have to carry, and so far, it has been a
small burden. But exposing the guts of the vectorizers (which are
already getting to a point where the need large refactorings), which
will affect all targets, need a bit more of concrete information.

The last thing we want is to keep changing how the vectorizer behaves
every six months without any concrete information as to why.

I also understand that LLVM is great at prototyping, and that's an
important step for companies like ARM to make sure their features work
as reliably as they expect in the wild, but I think adding new IR
semantics and completely refactoring core LLVM passes without a clue
is a few steps too far.

I'm not asking for a full spec. All I'm asking is for a description of
the intended basic functionality. Addressing modes, how to extract
information from unknown lanes, or if all reduction steps will be done
like `saddv`. Without that information, I cannot know what is the best
IR representation for scalable vectors or what will be the semantics
of shufffle / extract / insert operations.

> "complex constant" is the term used within the LangRef.  Although
its value can be different across certain interfaces this does not need to be
modelled within the IR and thus for all intents and purposes we can safely
consider it to be constant.
>From the LangRef:
"Complex constants are a (potentially recursive) combination of simple
constants and smaller complex constants."

There's nothing there saying it doesn't need to be modeled in IR.

> "vscale" is not trying to represent the result of such
speculation. It's purely a constant runtime vector length multiplier.  Such
a value is required by LoopVectorize to update induction variables as describe
below plus simple interactions like extracting the last element of a scalable
vector.
Right, I'm beginning to see what you mean...

The vectorizer needs that to be a constant at compile time to make
safety assurances.

For instance: for (1..N) { a[i+3] = a[i] + i; }

Has a max VF of 3. If the vectorizer is to act on that loop, it'll
have to change "vscale" to 3. If there are no loop dependencies, then
you leave as "vscale" but vectorizes anyway.

Other assurances are done for run time constants, for instance, tail
loops when changing

for (i=0; i<N; i++)   ->    for (i=0; i<N; i+=VF)

That VF is now a run-time "constant", and the vectorizer needs to see
it as much, otherwise it can't even test for validity.

So, the vectorizer will need to be taught two things:

1. "vscale" is a run time constant, and for the purpose of validity,
can be shrunk to any value down to two. If the value is shrunk, the
new compile time constant replaces vscale.

2. The cost model will *have* to treat "vscale" as an actual compile
time constant. This could come from a target feature, overriden by a
command line flag but there has to be a default, which I'd assume is
4, given that it's the lowest length.

>     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>
> for a VF of "n*4" (remembering that vscale is the "n"
in "<n x 4 x Ty>")
I see what you mean.

Quick question: Since you're saying "vscale" is an unknown
constant,
why not just:

   %index.next = add nuw nsw i64 %index, i64 vscale

All scalable operations will be tied up by the predication vector
anyway, and you already know what the vector type size is anyway.

The only worry is about providing redundant information that could go
stale and introduce bugs.

I'm assuming the vectorizer will *have* to learn about the compulsory
predication and build those vectors, or the back-end will have to
handle them, and it can get ugly.

>> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start,
i32 %step)
>
> This intrinsic matches the seriesvector instruction we original proposed. 
However, on reflection we didn't like how it allowed multiple
representations for the same constant.
Can you expand how this allows multiple representations for the same constant?

This is a series, with a start and a step, and will only be identical
to another which has the same start and step.

Just like C constants can "appear" different...

const int foo = 4;
const int bar = foo;
const int baz = 2 + 2;

> I know this doesn't preclude the use of an intrinsic, I just wanted to
highlight that doing so doesn't automatically change the surrounding IR.
I don't mind IR changes, I'm just trying to understand the need for it.

Normally, what we did in the past for some things was to add
intrinsics and then, if it's clear a native IR construct would be
better, we change it.

At least the intrinsic can be easily added without breaking
compatibility with anything, and since we're in prototyping phase
anyway, changing the IR would be the worst idea.

cheers,
--renato

Eric Christopher via llvm-dev

2016-Nov-26 20:40 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On Sat, Nov 26, 2016 at 9:07 AM Renato Golin via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at arm.com>
wrote:
> > Related to this I want to push this and related conversations in a
> different direction.  From the outset our approach to add SVE support to
> LLVM IR has been about solving the generic problem of vectorising for an
> unknown vector length and then extending this to support predication.  With
> this in mind I would rather the problem and its solution be discussed at
> the IR's level of abstraction rather than getting into the guts of SVE.
>
> Hi Paul,
>
> How scalable vectors operate is intimately related to how you
> represent them in IR. It took a long time for the vector types to be
> mapped to all available semantics. We still had to use a bunch of
> intrinsics for scatter / gather, it took years to get the strided
> access settled.
>
> I understand that scalable vectors are orthogonal to all this, but as
> a new concept, one that isn't available in any open source compiler I
> know of, is one that will likely be very vague. Not publishing the
> specs only make it worse.
>
> I take the example of the ACLE and ARMv8.2 patches that ARM has been
> pushing upstream. I have no idea what the new additions are, so I have
> to take your word that they're correct. But later on, different
> behaviour comes along for the same features with a comment "it
didn't
> work that way, let's try this". Sometimes, I don't even know
what
> failed, or why this new thing is better.
>
> When that behaviour is constricted to the ARM back-end, it's ok.
It's
> a burden that me and Tim will have to carry, and so far, it has been a
> small burden. But exposing the guts of the vectorizers (which are
> already getting to a point where the need large refactorings), which
> will affect all targets, need a bit more of concrete information.
>
> The last thing we want is to keep changing how the vectorizer behaves
> every six months without any concrete information as to why.
>
> I also understand that LLVM is great at prototyping, and that's an
> important step for companies like ARM to make sure their features work
> as reliably as they expect in the wild, but I think adding new IR
> semantics and completely refactoring core LLVM passes without a clue
> is a few steps too far.
>
> I'm not asking for a full spec. All I'm asking is for a description
of
> the intended basic functionality. Addressing modes, how to extract
> information from unknown lanes, or if all reduction steps will be done
> like `saddv`. Without that information, I cannot know what is the best
> IR representation for scalable vectors or what will be the semantics
> of shufffle / extract / insert operations.
>
>
> > "complex constant" is the term used within the LangRef. 
Although its
> value can be different across certain interfaces this does not need to be
> modelled within the IR and thus for all intents and purposes we can safely
> consider it to be constant.
>
> From the LangRef:
>
> "Complex constants are a (potentially recursive) combination of simple
> constants and smaller complex constants."
>
> There's nothing there saying it doesn't need to be modeled in IR.
>
>
> > "vscale" is not trying to represent the result of such
speculation. It's
> purely a constant runtime vector length multiplier.  Such a value is
> required by LoopVectorize to update induction variables as describe below
> plus simple interactions like extracting the last element of a scalable
> vector.
>
> Right, I'm beginning to see what you mean...
>
> The vectorizer needs that to be a constant at compile time to make
> safety assurances.
>
> For instance: for (1..N) { a[i+3] = a[i] + i; }
>
> Has a max VF of 3. If the vectorizer is to act on that loop, it'll
> have to change "vscale" to 3. If there are no loop dependencies,
then
> you leave as "vscale" but vectorizes anyway.
>
> Other assurances are done for run time constants, for instance, tail
> loops when changing
>
> for (i=0; i<N; i++)   ->    for (i=0; i<N; i+=VF)
>
> That VF is now a run-time "constant", and the vectorizer needs to
see
> it as much, otherwise it can't even test for validity.
>
> So, the vectorizer will need to be taught two things:
>
> 1. "vscale" is a run time constant, and for the purpose of
validity,
> can be shrunk to any value down to two. If the value is shrunk, the
> new compile time constant replaces vscale.
>
> 2. The cost model will *have* to treat "vscale" as an actual
compile
> time constant. This could come from a target feature, overriden by a
> command line flag but there has to be a default, which I'd assume is
> 4, given that it's the lowest length.
>
>
>
> >     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> >
> > for a VF of "n*4" (remembering that vscale is the
"n" in "<n x 4 x Ty>")
>
> I see what you mean.
>
> Quick question: Since you're saying "vscale" is an unknown
constant,
> why not just:
>
>    %index.next = add nuw nsw i64 %index, i64 vscale
>
> All scalable operations will be tied up by the predication vector
> anyway, and you already know what the vector type size is anyway.
>
> The only worry is about providing redundant information that could go
> stale and introduce bugs.
>
> I'm assuming the vectorizer will *have* to learn about the compulsory
> predication and build those vectors, or the back-end will have to
> handle them, and it can get ugly.
>
>
> >> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32
%start, i32
> %step)
> >
> > This intrinsic matches the seriesvector instruction we original
> proposed.  However, on reflection we didn't like how it allowed
multiple
> representations for the same constant.
>
> Can you expand how this allows multiple representations for the same
> constant?
>
> This is a series, with a start and a step, and will only be identical
> to another which has the same start and step.
>
> Just like C constants can "appear" different...
>
> const int foo = 4;
> const int bar = foo;
> const int baz = 2 + 2;
>
>
> > I know this doesn't preclude the use of an intrinsic, I just
wanted to
> highlight that doing so doesn't automatically change the surrounding
IR.
>
> I don't mind IR changes, I'm just trying to understand the need for
it.
>
> Normally, what we did in the past for some things was to add
> intrinsics and then, if it's clear a native IR construct would be
> better, we change it.
>
> At least the intrinsic can be easily added without breaking
> compatibility with anything, and since we're in prototyping phase
> anyway, changing the IR would be the worst idea.
>
>These last 3 paragraphs are a great summary of my position on this as well.

Thanks!

-eric

> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161126/f5daa40c/attachment.html>

Paul Walker via llvm-dev

2016-Nov-27 13:59 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Thanks Renato, my takeaway is that I am presenting the design out of order.  So
let's focus purely on the vector length (VL) and ignore everything else. 
For SVE the vector length is unknown and can vary across an as yet undetermined
boundary (process, library....).  Within a boundary we propose making VL a
constant with all instructions that operate on this constant locked within its
boundary.

I know this stretches the meaning of constant and my reasoning (however unsound)
is below.  We expect changes to VL to be infrequent and not located where it
would present an unnecessary barrier to optimisation.  With this in mind the
initial implementation of VL barriers would be an intrinsic that prevents any
instruction movement across it.

Question: Is this type of intrinsic something LLVM supports today?

Why a constant? Well it doesn't change within the context it is being used.
More crucially the LLVM implementation of constants gives us a property
that's very important to SVE (perhaps this is where prototyping laziness has
kicked in).  Constants remain attached to the instructions that operate on them
through until code generation.  This allows the semantic meaning of these
instruction to be maintained, something non-scalable vectors get for free with
their "real" constants.

As a specific example take the vector reversal that LoopVectorize does when
iterating backward through memory.  For non-scalable vectors this looks thusly:

	shufflevector <4 x i32> %a, <4 x i32> undef, <i32 3, i32 2, i32
1, i32 0>

Throughout the IR and into code generation the intention of this instruction is
clear.  Now turning to scalable vectors the same operation becomes:

	shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef, <n x 4 x
i32> seriesvector ( sub (i32 VL, 1), i32 -1)

Firstly I'll highlight the use of seriesvector is purely for brevity,
let's ignore that debate for now.  Our concern is that not treating VL as a
Constant means sub and seriesvector are no longer constant and are likely to be
hoisted away from the shufflevector.  The knock on effect being to force the
code generator into generating generic vector permutes rather than utilise any
specialised permute instructions the target provides.

Does this make sense? I am not after agreement just want to make sure we are on
the same page regarding our aims before digging down into how VL actually looks
and its interaction with the loop vectoriser’s chosen VF.

	Paul!!!

p.s.

I'll respond to the stepvector question later in a separate post to break
down the different discussion points.


On 26/11/2016, 17:07, "Renato Golin" <renato.golin at
linaro.org> wrote:

    On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at arm.com>
wrote:
    > Related to this I want to push this and related conversations in a
different direction.  From the outset our approach to add SVE support to LLVM IR
has been about solving the generic problem of vectorising for an unknown vector
length and then extending this to support predication.  With this in mind I
would rather the problem and its solution be discussed at the IR's level of
abstraction rather than getting into the guts of SVE.
    
    Hi Paul,
    
    How scalable vectors operate is intimately related to how you
    represent them in IR. It took a long time for the vector types to be
    mapped to all available semantics. We still had to use a bunch of
    intrinsics for scatter / gather, it took years to get the strided
    access settled.
    
    I understand that scalable vectors are orthogonal to all this, but as
    a new concept, one that isn't available in any open source compiler I
    know of, is one that will likely be very vague. Not publishing the
    specs only make it worse.
    
    I take the example of the ACLE and ARMv8.2 patches that ARM has been
    pushing upstream. I have no idea what the new additions are, so I have
    to take your word that they're correct. But later on, different
    behaviour comes along for the same features with a comment "it
didn't
    work that way, let's try this". Sometimes, I don't even know
what
    failed, or why this new thing is better.
    
    When that behaviour is constricted to the ARM back-end, it's ok.
It's
    a burden that me and Tim will have to carry, and so far, it has been a
    small burden. But exposing the guts of the vectorizers (which are
    already getting to a point where the need large refactorings), which
    will affect all targets, need a bit more of concrete information.
    
    The last thing we want is to keep changing how the vectorizer behaves
    every six months without any concrete information as to why.
    
    I also understand that LLVM is great at prototyping, and that's an
    important step for companies like ARM to make sure their features work
    as reliably as they expect in the wild, but I think adding new IR
    semantics and completely refactoring core LLVM passes without a clue
    is a few steps too far.
    
    I'm not asking for a full spec. All I'm asking is for a description
of
    the intended basic functionality. Addressing modes, how to extract
    information from unknown lanes, or if all reduction steps will be done
    like `saddv`. Without that information, I cannot know what is the best
    IR representation for scalable vectors or what will be the semantics
    of shufffle / extract / insert operations.
    
    
    > "complex constant" is the term used within the LangRef. 
Although its value can be different across certain interfaces this does not need
to be modelled within the IR and thus for all intents and purposes we can safely
consider it to be constant.
    
    From the LangRef:
    
    "Complex constants are a (potentially recursive) combination of simple
    constants and smaller complex constants."
    
    There's nothing there saying it doesn't need to be modeled in IR.
    
    
    > "vscale" is not trying to represent the result of such
speculation. It's purely a constant runtime vector length multiplier.  Such
a value is required by LoopVectorize to update induction variables as describe
below plus simple interactions like extracting the last element of a scalable
vector.
    
    Right, I'm beginning to see what you mean...
    
    The vectorizer needs that to be a constant at compile time to make
    safety assurances.
    
    For instance: for (1..N) { a[i+3] = a[i] + i; }
    
    Has a max VF of 3. If the vectorizer is to act on that loop, it'll
    have to change "vscale" to 3. If there are no loop dependencies,
then
    you leave as "vscale" but vectorizes anyway.
    
    Other assurances are done for run time constants, for instance, tail
    loops when changing
    
    for (i=0; i<N; i++)   ->    for (i=0; i<N; i+=VF)
    
    That VF is now a run-time "constant", and the vectorizer needs to
see
    it as much, otherwise it can't even test for validity.
    
    So, the vectorizer will need to be taught two things:
    
    1. "vscale" is a run time constant, and for the purpose of
validity,
    can be shrunk to any value down to two. If the value is shrunk, the
    new compile time constant replaces vscale.
    
    2. The cost model will *have* to treat "vscale" as an actual
compile
    time constant. This could come from a target feature, overriden by a
    command line flag but there has to be a default, which I'd assume is
    4, given that it's the lowest length.
    
    
    
    >     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
    >
    > for a VF of "n*4" (remembering that vscale is the
"n" in "<n x 4 x Ty>")
    
    I see what you mean.
    
    Quick question: Since you're saying "vscale" is an unknown
constant,
    why not just:
    
       %index.next = add nuw nsw i64 %index, i64 vscale
    
    All scalable operations will be tied up by the predication vector
    anyway, and you already know what the vector type size is anyway.
    
    The only worry is about providing redundant information that could go
    stale and introduce bugs.
    
    I'm assuming the vectorizer will *have* to learn about the compulsory
    predication and build those vectors, or the back-end will have to
    handle them, and it can get ugly.
    
    
    >> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32
%start, i32 %step)
    >
    > This intrinsic matches the seriesvector instruction we original
proposed.  However, on reflection we didn't like how it allowed multiple
representations for the same constant.
    
    Can you expand how this allows multiple representations for the same
constant?
    
    This is a series, with a start and a step, and will only be identical
    to another which has the same start and step.
    
    Just like C constants can "appear" different...
    
    const int foo = 4;
    const int bar = foo;
    const int baz = 2 + 2;
    
    
    > I know this doesn't preclude the use of an intrinsic, I just wanted
to highlight that doing so doesn't automatically change the surrounding IR.
    
    I don't mind IR changes, I'm just trying to understand the need for
it.
    
    Normally, what we did in the past for some things was to add
    intrinsics and then, if it's clear a native IR construct would be
    better, we change it.
    
    At least the intrinsic can be easily added without breaking
    compatibility with anything, and since we're in prototyping phase
    anyway, changing the IR would be the worst idea.
    
    cheers,
    --renato

Amara Emerson via llvm-dev

2016-Nov-27 15:32 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

> I'm not asking for a full spec. All I'm asking is for a description
of
> the intended basic functionality. Addressing modes, how to extract
> information from unknown lanes, or if all reduction steps will be done
> like `saddv`. Without that information, I cannot know what is the best
> IR representation for scalable vectors or what will be the semantics
> of shufffle / extract / insert operations.
>
If you want to know more, our dev meeting talk and slides will
hopefully be available soon. If there'll be a significant delay we can
publish the slides ourselves for you to look at, those should be
sufficient for you to understand enough of the details to form an
opinion. We also have a white paper on general SVE and vector-length
agnostic programming available here:
http://developer.arm.com/hpc/a-sneak-peek-into-sve-and-vla-programming

Thanks,
Amara

On 26 November 2016 at 17:07, Renato Golin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at arm.com>
wrote:
>> Related to this I want to push this and related conversations in a
different direction.  From the outset our approach to add SVE support to LLVM IR
has been about solving the generic problem of vectorising for an unknown vector
length and then extending this to support predication.  With this in mind I
would rather the problem and its solution be discussed at the IR's level of
abstraction rather than getting into the guts of SVE.
>
> Hi Paul,
>
> How scalable vectors operate is intimately related to how you
> represent them in IR. It took a long time for the vector types to be
> mapped to all available semantics. We still had to use a bunch of
> intrinsics for scatter / gather, it took years to get the strided
> access settled.
>
> I understand that scalable vectors are orthogonal to all this, but as
> a new concept, one that isn't available in any open source compiler I
> know of, is one that will likely be very vague. Not publishing the
> specs only make it worse.
>
> I take the example of the ACLE and ARMv8.2 patches that ARM has been
> pushing upstream. I have no idea what the new additions are, so I have
> to take your word that they're correct. But later on, different
> behaviour comes along for the same features with a comment "it
didn't
> work that way, let's try this". Sometimes, I don't even know
what
> failed, or why this new thing is better.
>
> When that behaviour is constricted to the ARM back-end, it's ok.
It's
> a burden that me and Tim will have to carry, and so far, it has been a
> small burden. But exposing the guts of the vectorizers (which are
> already getting to a point where the need large refactorings), which
> will affect all targets, need a bit more of concrete information.
>
> The last thing we want is to keep changing how the vectorizer behaves
> every six months without any concrete information as to why.
>
> I also understand that LLVM is great at prototyping, and that's an
> important step for companies like ARM to make sure their features work
> as reliably as they expect in the wild, but I think adding new IR
> semantics and completely refactoring core LLVM passes without a clue
> is a few steps too far.
>
> I'm not asking for a full spec. All I'm asking is for a description
of
> the intended basic functionality. Addressing modes, how to extract
> information from unknown lanes, or if all reduction steps will be done
> like `saddv`. Without that information, I cannot know what is the best
> IR representation for scalable vectors or what will be the semantics
> of shufffle / extract / insert operations.
>
>
>> "complex constant" is the term used within the LangRef. 
Although its value can be different across certain interfaces this does not need
to be modelled within the IR and thus for all intents and purposes we can safely
consider it to be constant.
>
> From the LangRef:
>
> "Complex constants are a (potentially recursive) combination of simple
> constants and smaller complex constants."
>
> There's nothing there saying it doesn't need to be modeled in IR.
>
>
>> "vscale" is not trying to represent the result of such
speculation. It's purely a constant runtime vector length multiplier.  Such
a value is required by LoopVectorize to update induction variables as describe
below plus simple interactions like extracting the last element of a scalable
vector.
>
> Right, I'm beginning to see what you mean...
>
> The vectorizer needs that to be a constant at compile time to make
> safety assurances.
>
> For instance: for (1..N) { a[i+3] = a[i] + i; }
>
> Has a max VF of 3. If the vectorizer is to act on that loop, it'll
> have to change "vscale" to 3. If there are no loop dependencies,
then
> you leave as "vscale" but vectorizes anyway.
>
> Other assurances are done for run time constants, for instance, tail
> loops when changing
>
> for (i=0; i<N; i++)   ->    for (i=0; i<N; i+=VF)
>
> That VF is now a run-time "constant", and the vectorizer needs to
see
> it as much, otherwise it can't even test for validity.
>
> So, the vectorizer will need to be taught two things:
>
> 1. "vscale" is a run time constant, and for the purpose of
validity,
> can be shrunk to any value down to two. If the value is shrunk, the
> new compile time constant replaces vscale.
>
> 2. The cost model will *have* to treat "vscale" as an actual
compile
> time constant. This could come from a target feature, overriden by a
> command line flag but there has to be a default, which I'd assume is
> 4, given that it's the lowest length.
>
>
>
>>     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>>
>> for a VF of "n*4" (remembering that vscale is the
"n" in "<n x 4 x Ty>")
>
> I see what you mean.
>
> Quick question: Since you're saying "vscale" is an unknown
constant,
> why not just:
>
>    %index.next = add nuw nsw i64 %index, i64 vscale
>
> All scalable operations will be tied up by the predication vector
> anyway, and you already know what the vector type size is anyway.
>
> The only worry is about providing redundant information that could go
> stale and introduce bugs.
>
> I'm assuming the vectorizer will *have* to learn about the compulsory
> predication and build those vectors, or the back-end will have to
> handle them, and it can get ugly.
>
>
>>> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32
%start, i32 %step)
>>
>> This intrinsic matches the seriesvector instruction we original
proposed.  However, on reflection we didn't like how it allowed multiple
representations for the same constant.
>
> Can you expand how this allows multiple representations for the same
constant?
>
> This is a series, with a start and a step, and will only be identical
> to another which has the same start and step.
>
> Just like C constants can "appear" different...
>
> const int foo = 4;
> const int bar = foo;
> const int baz = 2 + 2;
>
>
>> I know this doesn't preclude the use of an intrinsic, I just wanted
to highlight that doing so doesn't automatically change the surrounding IR.
>
> I don't mind IR changes, I'm just trying to understand the need for
it.
>
> Normally, what we did in the past for some things was to add
> intrinsics and then, if it's clear a native IR construct would be
> better, we change it.
>
> At least the intrinsic can be easily added without breaking
> compatibility with anything, and since we're in prototyping phase
> anyway, changing the IR would be the worst idea.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2016-Nov-27 15:34 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 27 November 2016 at 15:32, Amara Emerson <amara.emerson at gmail.com>
wrote:> http://developer.arm.com/hpc/a-sneak-peek-into-sve-and-vla-programming
Hi Amara,

Thanks! I've already dissected that one. :)

It's probably the easiest way into SVE, for now.

cheers,
--renato

C Bergström via llvm-dev

2016-Nov-27 15:35 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

I'm sorry.. may I interrupt for a minute and try to grok things for a
bit different angle..

While the VL can vary.. in practice wouldn't the cost of vectorization
and width be tied more to the hardware implementation than anything
else? The cost of vectorizing thread 1 vs 2 isn't likely to change?
(Am I drunk and mistaken?)

If the above holds true then the the length would be only variable
between different hardware implementations.. (at least this is how I
understand it)

This seems tightly coupled to hardware..


On Sun, Nov 27, 2016 at 9:59 PM, Paul Walker via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Thanks Renato, my takeaway is that I am presenting the design out of order.
So let's focus purely on the vector length (VL) and ignore everything else. 
For SVE the vector length is unknown and can vary across an as yet undetermined
boundary (process, library....).  Within a boundary we propose making VL a
constant with all instructions that operate on this constant locked within its
boundary.
>
> I know this stretches the meaning of constant and my reasoning (however
unsound) is below.  We expect changes to VL to be infrequent and not located
where it would present an unnecessary barrier to optimisation.  With this in
mind the initial implementation of VL barriers would be an intrinsic that
prevents any instruction movement across it.
>
> Question: Is this type of intrinsic something LLVM supports today?
>
> Why a constant? Well it doesn't change within the context it is being
used. More crucially the LLVM implementation of constants gives us a property
that's very important to SVE (perhaps this is where prototyping laziness has
kicked in).  Constants remain attached to the instructions that operate on them
through until code generation.  This allows the semantic meaning of these
instruction to be maintained, something non-scalable vectors get for free with
their "real" constants.
>
> As a specific example take the vector reversal that LoopVectorize does when
iterating backward through memory.  For non-scalable vectors this looks thusly:
>
>         shufflevector <4 x i32> %a, <4 x i32> undef, <i32 3,
i32 2, i32 1, i32 0>
>
> Throughout the IR and into code generation the intention of this
instruction is clear.  Now turning to scalable vectors the same operation
becomes:
>
>         shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef,
<n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1)
>
> Firstly I'll highlight the use of seriesvector is purely for brevity,
let's ignore that debate for now.  Our concern is that not treating VL as a
Constant means sub and seriesvector are no longer constant and are likely to be
hoisted away from the shufflevector.  The knock on effect being to force the
code generator into generating generic vector permutes rather than utilise any
specialised permute instructions the target provides.
>
> Does this make sense? I am not after agreement just want to make sure we
are on the same page regarding our aims before digging down into how VL actually
looks and its interaction with the loop vectoriser’s chosen VF.
>
>         Paul!!!
>
> p.s.
>
> I'll respond to the stepvector question later in a separate post to
break down the different discussion points.
>
>
> On 26/11/2016, 17:07, "Renato Golin" <renato.golin at
linaro.org> wrote:
>
>     On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at
arm.com> wrote:
>     > Related to this I want to push this and related conversations in a
different direction.  From the outset our approach to add SVE support to LLVM IR
has been about solving the generic problem of vectorising for an unknown vector
length and then extending this to support predication.  With this in mind I
would rather the problem and its solution be discussed at the IR's level of
abstraction rather than getting into the guts of SVE.
>
>     Hi Paul,
>
>     How scalable vectors operate is intimately related to how you
>     represent them in IR. It took a long time for the vector types to be
>     mapped to all available semantics. We still had to use a bunch of
>     intrinsics for scatter / gather, it took years to get the strided
>     access settled.
>
>     I understand that scalable vectors are orthogonal to all this, but as
>     a new concept, one that isn't available in any open source compiler
I
>     know of, is one that will likely be very vague. Not publishing the
>     specs only make it worse.
>
>     I take the example of the ACLE and ARMv8.2 patches that ARM has been
>     pushing upstream. I have no idea what the new additions are, so I have
>     to take your word that they're correct. But later on, different
>     behaviour comes along for the same features with a comment "it
didn't
>     work that way, let's try this". Sometimes, I don't even
know what
>     failed, or why this new thing is better.
>
>     When that behaviour is constricted to the ARM back-end, it's ok.
It's
>     a burden that me and Tim will have to carry, and so far, it has been a
>     small burden. But exposing the guts of the vectorizers (which are
>     already getting to a point where the need large refactorings), which
>     will affect all targets, need a bit more of concrete information.
>
>     The last thing we want is to keep changing how the vectorizer behaves
>     every six months without any concrete information as to why.
>
>     I also understand that LLVM is great at prototyping, and that's an
>     important step for companies like ARM to make sure their features work
>     as reliably as they expect in the wild, but I think adding new IR
>     semantics and completely refactoring core LLVM passes without a clue
>     is a few steps too far.
>
>     I'm not asking for a full spec. All I'm asking is for a
description of
>     the intended basic functionality. Addressing modes, how to extract
>     information from unknown lanes, or if all reduction steps will be done
>     like `saddv`. Without that information, I cannot know what is the best
>     IR representation for scalable vectors or what will be the semantics
>     of shufffle / extract / insert operations.
>
>
>     > "complex constant" is the term used within the LangRef. 
Although its value can be different across certain interfaces this does not need
to be modelled within the IR and thus for all intents and purposes we can safely
consider it to be constant.
>
>     From the LangRef:
>
>     "Complex constants are a (potentially recursive) combination of
simple
>     constants and smaller complex constants."
>
>     There's nothing there saying it doesn't need to be modeled in
IR.
>
>
>     > "vscale" is not trying to represent the result of such
speculation. It's purely a constant runtime vector length multiplier.  Such
a value is required by LoopVectorize to update induction variables as describe
below plus simple interactions like extracting the last element of a scalable
vector.
>
>     Right, I'm beginning to see what you mean...
>
>     The vectorizer needs that to be a constant at compile time to make
>     safety assurances.
>
>     For instance: for (1..N) { a[i+3] = a[i] + i; }
>
>     Has a max VF of 3. If the vectorizer is to act on that loop, it'll
>     have to change "vscale" to 3. If there are no loop
dependencies, then
>     you leave as "vscale" but vectorizes anyway.
>
>     Other assurances are done for run time constants, for instance, tail
>     loops when changing
>
>     for (i=0; i<N; i++)   ->    for (i=0; i<N; i+=VF)
>
>     That VF is now a run-time "constant", and the vectorizer
needs to see
>     it as much, otherwise it can't even test for validity.
>
>     So, the vectorizer will need to be taught two things:
>
>     1. "vscale" is a run time constant, and for the purpose of
validity,
>     can be shrunk to any value down to two. If the value is shrunk, the
>     new compile time constant replaces vscale.
>
>     2. The cost model will *have* to treat "vscale" as an actual
compile
>     time constant. This could come from a target feature, overriden by a
>     command line flag but there has to be a default, which I'd assume
is
>     4, given that it's the lowest length.
>
>
>
>     >     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>     >
>     > for a VF of "n*4" (remembering that vscale is the
"n" in "<n x 4 x Ty>")
>
>     I see what you mean.
>
>     Quick question: Since you're saying "vscale" is an
unknown constant,
>     why not just:
>
>        %index.next = add nuw nsw i64 %index, i64 vscale
>
>     All scalable operations will be tied up by the predication vector
>     anyway, and you already know what the vector type size is anyway.
>
>     The only worry is about providing redundant information that could go
>     stale and introduce bugs.
>
>     I'm assuming the vectorizer will *have* to learn about the
compulsory
>     predication and build those vectors, or the back-end will have to
>     handle them, and it can get ugly.
>
>
>     >> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32
%start, i32 %step)
>     >
>     > This intrinsic matches the seriesvector instruction we original
proposed.  However, on reflection we didn't like how it allowed multiple
representations for the same constant.
>
>     Can you expand how this allows multiple representations for the same
constant?
>
>     This is a series, with a start and a step, and will only be identical
>     to another which has the same start and step.
>
>     Just like C constants can "appear" different...
>
>     const int foo = 4;
>     const int bar = foo;
>     const int baz = 2 + 2;
>
>
>     > I know this doesn't preclude the use of an intrinsic, I just
wanted to highlight that doing so doesn't automatically change the
surrounding IR.
>
>     I don't mind IR changes, I'm just trying to understand the need
for it.
>
>     Normally, what we did in the past for some things was to add
>     intrinsics and then, if it's clear a native IR construct would be
>     better, we change it.
>
>     At least the intrinsic can be easily added without breaking
>     compatibility with anything, and since we're in prototyping phase
>     anyway, changing the IR would be the worst idea.
>
>     cheers,
>     --renato
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2016-Nov-27 15:42 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 27 November 2016 at 13:59, Paul Walker <Paul.Walker at arm.com>
wrote:> Thanks Renato, my takeaway is that I am presenting the design out of order.
So let's focus purely on the vector length (VL) and ignore everything else. 
For SVE the vector length is unknown and can vary across an as yet undetermined
boundary (process, library....).  Within a boundary we propose making VL a
constant with all instructions that operate on this constant locked within its
boundary.
This is in line with my current understanding of SVE. Check.

> I know this stretches the meaning of constant and my reasoning (however
unsound) is below.  We expect changes to VL to be infrequent and not located
where it would present an unnecessary barrier to optimisation.  With this in
mind the initial implementation of VL barriers would be an intrinsic that
prevents any instruction movement across it.
>
> Question: Is this type of intrinsic something LLVM supports today?
Function calls are natural barriers, but they should outline the
parameters that cannot cross, especially if they're local, to make
sure those don't cross it. In that sense, specially crafted intrinsics
can get you the same behaviour, but it will be ugly.

Also, we have special purpose barriers, ex. @llvm.arm|aarch64.dmb,
which could serve as template for scalable-specific barriers.

> Why a constant? Well it doesn't change within the context it is being
used. More crucially the LLVM implementation of constants gives us a property
that's very important to SVE (perhaps this is where prototyping laziness has
kicked in).  Constants remain attached to the instructions that operate on them
through until code generation.  This allows the semantic meaning of these
instruction to be maintained, something non-scalable vectors get for free with
their "real" constants.
This makes sense. Not just because it behaves similarly, but because
the back-end *must* guarantee it will be a constant within its
boundaries and fail otherwise. That's up to the SVE code-generator to
add enough SVE-specific instructions to get that right.

>         shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef,
<n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1)
>
> Firstly I'll highlight the use of seriesvector is purely for brevity,
let's ignore that debate for now.  Our concern is that not treating VL as a
Constant means sub and seriesvector are no longer constant and are likely to be
hoisted away from the shufflevector.  The knock on effect being to force the
code generator into generating generic vector permutes rather than utilise any
specialised permute instructions the target provides.
The concept looks ok.

IIGIR, your argument is that an intrinsic will not look "constant
enough" to the other IR passes, which can break the contantness
required to generate the correct "constant" vector.

I'm also assuming SVE has an instruction that relates to the syntax
above, which will reduce the setup process from N instructions to one
and will be scale-independent. Otherwise, that whole exercise is
meaningless.

Something like:
  mov  x2, #i
  const       z0.b, p0/z, x2, 2     # From (i) to (2*VF)
  const       z1.b, p0/z, x2, -1    # From (i) to (i - VF) in reverse

The undefined behaviour that will come of such instructions need to be
understood in order to not break the IR.

For example, if x2 is an unsigned variable and you iterate through the
array but the array length is not a multiple of VF, the last range
will pass through zero and become negative at the end. Or, if x2 is a
16-bit variable that must wrap (or saturate) and the same tail issue
happens above.

> Does this make sense? I am not after agreement just want to make sure we
are on the same page regarding our aims before digging down into how VL actually
looks and its interaction with the loop vectoriser’s chosen VF.
As much sense as is possible, I guess.

But without knowing the guarantees we're aiming for, it'll be hard to
know if any of those proposals will make proper sense.

One way to make your "seriesvector" concept show up *before* any spec
is out is to apply it to non-scalable vectors.

Today, we have the "zeroinitializer", which is very similar to what
you want. You can even completely omit the "vscale" if we get the
semantics right.

Hope that helps.

cheers,
--renato

Paul Walker via llvm-dev

2016-Nov-28 15:37 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Hi Renato,

I’m just revisiting the seriesvector question you had a couple of days back.
>>%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start,
i32 %step)
>>
>>This intrinsic matches the seriesvector instruction we original
proposed.  However, on reflection we didn't like how it allowed multiple
representations for the same >>constant.
>
>Can you expand how this allows multiple representations for the same
constant?
>
>This is a series, with a start and a step, and will only be identical
>to another which has the same start and step.
>
>Just like C constants can "appear" different...
>
>const int foo = 4;
>const int bar = foo;
>const int baz = 2 + 2;
With the original "seriesvector" proposal the sequence 0, 2, 4, 6,
8... can be built is multiple ways:

(1)	seriesvector(i32 0, i32 2)
(2)	mul(seriesvector(i32 0, i32,1), splat(2))

Both patterns are likely to occur and so must be matched within InstCombine and
friends.  The solution could be to first simplify seriesvector related IR to a
canonical form so only a single pattern requires matching against.  The
"stepvector" proposal asks why we would bother introducing the problem
in the first place and instead force a canonical representation from the outset.

Obviously there will exist new idioms to canonicalise ((2*stepvector) +
stepvector <==> 3*stepvector) but at least the most common cases are
handled automatically.

We made this change when creating the upstream patch for
"seriesvector" and observed its effect on the code base.  When
switching to "stepvector" the resulting patch was about 1/5 the size. 
That said if a constant intrinsic is the way to go then I imagine said patch
will be even smaller.
> One way to make your "seriesvector" concept show up *before* any
spec
> is out is to apply it to non-scalable vectors.
That is my intention with the stepvector patch
(https://reviews.llvm.org/D27105).  You can see that the interface is common but
for non-scalable vectors the result is its equivalent ConstantVector.  Once an
agreed form is available LoopVectorize::getStepVector can be converted to become
compatible with scalable vectors very easily, albeit being only a small step on
a long road.

	Paul!!!

Renato Golin via llvm-dev

2016-Nov-28 16:38 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 28 November 2016 at 15:37, Paul Walker <Paul.Walker at arm.com>
wrote:> (1)     seriesvector(i32 0, i32 2)
> (2)     mul(seriesvector(i32 0, i32,1), splat(2))
>
> Both patterns are likely to occur and so must be matched within InstCombine
and friends.  The solution could be to first simplify seriesvector related IR to
a canonical form so only a single pattern requires matching against.  The
"stepvector" proposal asks why we would bother introducing the problem
in the first place and instead force a canonical representation from the outset.
>
> Obviously there will exist new idioms to canonicalise ((2*stepvector) +
stepvector <==> 3*stepvector) but at least the most common cases are
handled automatically.
I guess this is true for all constants, and it falls to things like
constant folding to bring it to a canonical representation.

> We made this change when creating the upstream patch for
"seriesvector" and observed its effect on the code base.  When
switching to "stepvector" the resulting patch was about 1/5 the size. 
That said if a constant intrinsic is the way to go then I imagine said patch
will be even smaller.
If we use an intrinsic, we'll need to add this special case to the
vectoriser and maybe other passes, to treat that as a constant, but it
will be an addition, not a replacement.

If we create a new constant type, we'll need to change *a lot* of
existing tests (providing we want to use it across the board), which
will be a more complex change overall.

I'd leave that for later, once we have settled the semantics between
SVE and RISC-V, and just have an intrinsic for now. Once RISC-V
implementation comes, we can compare it to SVE, and if it's identical,
it'll be easier to implement as an intrinsic (for quick enablement),
then later on, it'll be a no-brainer refactory.

But this is just my personal (and reasonably uninformed) opinion. I'll
let other people broaden the discussion a bit. :)

cheers,
--renato

Chandler Carruth via llvm-dev

2016-Nov-29 21:33 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On Mon, Nov 28, 2016 at 7:37 AM Paul Walker via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> That is my intention with the stepvector patch (
> https://reviews.llvm.org/D27105).  You can see that the interface is
> common but for non-scalable vectors the result is its equivalent
> ConstantVector.  Once an agreed form is available
> LoopVectorize::getStepVector can be converted to become compatible with
> scalable vectors very easily, albeit being only a small step on a long
road.
>
Ok, I'm still catching up on this thread, but I think starting to review
patches is going to make it substantially harder to have a productive
conversation. We haven't yet really gotten to any kind of consensus around
the design, and until then I think it would be very helpful to keep
discussion focused on the high-level threads on llvm-dev rather than
fragmenting it into the commits list threads for the patches. I'm happy to
have patches as FYI examples, but they shouldn't be the focus of the
discussion.

Also, up the thread and even in this email there is significant talk about
a substantial change of design based on feedback from the dev meeting. But
there are over 40 emails already here and I've not found an actual concise
high level overview of the *new* design.

I suspect it would be helpful to start a fresh RFC thread with a *concise*
description of the new design so that folks can skip ahead and more
productively join that discussion.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161129/b4cac842/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Nov 2016 - [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Apparently Analagous Threads