thr3ads.net - llvm dev - [llvm-dev] Adding support for vscale [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Sander De Smalen via llvm-dev

2019-Oct-01 23:00 UTC

[llvm-dev] Adding support for vscale

Hi Luke,
> was it intentional to leave out both jacob and myself?
> [...]
> if that was a misunderstanding or an oversight i apologise for raising it.
It was definitely not my intention to be non-inclusive, my apologies if that
seemed the case!
> can i therefore recommend a change, here:
> [...]
> "This patch adds vscale as a symbolic constant to the IR, similar to
> undef and zeroinitializer, so that vscale - representing the
> runtime-detected "element processing" capacity - can be used in
> constant expressions"Thanks for the suggestion! I like the use of the word `capacity` especially now
that the term 'vector length' has overloaded meanings.
I'll add some extra words to the vscale patch to clarify its meaning.
> my only concern would be: some circumstances (some algorithms) may
> perform better with MMX, some with SSE, some with different levels of
> performance on e.g. AMD or Intel, which would, with benchmarking, show
> that some algorithms perform better if vscale=8 (resulting in some
> other MMX/SSE subset being utilised) than if vscale=16.If fixed-width/short vectors are more beneficial for some algorithm, I'd
recommend using fixed-width vectors directly. It would be up to the target to
lower that to the vector instruction set. For AArch64, this can be done using
Neon (max 128bits) or with SVE/SVE2 using a 'fixed-width' predicate
mask, e.g. vl4 for a predicate of 4 elements, even when the vector capacity is
larger than 4.
> would it be reasonable to assume that predication *always* is to be
> used in combination with vscale?  or is it the intention to
> [eventually] be able to auto-generate the kinds of [painful in
> retrospect] SIMD assembly shown in the above article?
When the size of a vector is constant throughout the program, but unknown at
compile-time, then some form of masking would be required for loads and stores
(or other instructions that may cause an exception). So it is reasonable to
assume that predication is used for such vectors.
>> This model would be complementary to `vscale`, as it still requires the
>> same scalable vector type to describe a vector of unknown size.
> 
> ah.  that's where the assumption breaks down, because of SV allowing
> its vectors to "sit" on top of the *actual* scalar regfile(s), we
do
> in fact permit an [immediate-specified] vscale to be set, arbitrarily,
> at any time.Maybe I'm missing something here, but if SV uses an immediate to define
vscale, that implies the value of vscale is known at compile-time and thus
regular (fixed-width) vector types can be used?
> now, we mmmiiiight be able to get away with assuming that vscale is
> equal to the absolute maximum possible setting (64 for RV64, 32 for
> RV32), then use / play-with the "runtime active VL get/set"
> intrinsics.
> 
> i'm kiinda wary of saying "absolutely yes that's the way
forward" for
> us, particularly without some input from Jacob here.Note that there isn't a requirement to use `vscale` as proposed in my first
patch. If RV only cares about the runtime active-VL then some explicit, separate
mechanism to get/set the active VL would be needed anyway. I imagine the
resulting runtime value (instead of `vscale`) to then be used in loop indvar
updates, address computations, etc.
> ok, a link to that would be handy... let me see if i can find it...
> what comes up is this: https://reviews.llvm.org/D57504 is that right?
Yes, that's the one!

Thanks,

Sander
> On 1 Oct 2019, at 14:42, Luke Kenneth Casson Leighton <lkcl at
lkcl.net> wrote:
> 
> (readers note this, copied from the end before writing!
> "Given that (2) is a very different use-case, I hope we can keep
discussions on
> that model separate from this thread, if possible.")
> 
> 
> On Tue, Oct 1, 2019 at 12:45 PM Sander De Smalen
> <Sander.DeSmalen at arm.com> wrote:
> 
>> Thanks @Robin and @Graham for giving some background on scalable
vectors and clarifying some of the details!
> 
> hi sander, thanks for chipping in.  um, just a point of order: was it
> intentional to leave out both jacob and myself?  my understanding is
> that inclusive and welcoming language is supposed to used within this
> community, and it *might* be mistaken as being exclusionary and
> unwelcoming.
> 
> if that was a misunderstanding or an oversight i apologise for raising it.
> 
>> Apologies if I'm repeating things here, but it is probably good to
emphasize
>> the conceptually different, but complementary models for scalable
vectors:
>> 1. Vectors of unknown, but constant size throughout the program.
> 
> ... which matches with both hardware-fixed per-implementation
> variations in potential [max] SIMD-width for any given architecture as
> well as Vector-based "Maximum Vector Length", typically
representing
> the "Lanes" of a [traditional] Vector Architecture.
> 
>> 2. Vectors of changing size throughout the program.
> 
> ...representing VL in "Cray-style" Vector Engines (NEC SX-Aurora,
RVV,
> SV) and representing the (rather unfortunate) corner-case cleanup -
> and predication - deployed in SIMD
> (https://www.sigarch.org/simd-instructions-considered-harmful/)
> 
>> Where (2) basically builds on (1).
>> 
>> LLVM's scalable vectors support (1) directly. The scalable type is
defined
>> using the concept `vscale` that is constant throughout the program and
>> expresses the unknown, but maximum size of a scalable vector.
>> My patch builds on that definition by adding `vscale` as a keyword that
>> can be used in expressions.
> 
> ah HA!  excccellent.  *that* was the sentence giving the key piece of
> information needed to understand what is going on, here.  i appreciate
> it does actually say that, "This patch adds vscale as a symbolic
> constant to the IR, similar to
> undef and zeroinitializer, so that it can be used in constant
> expressions" however without the context about what vscale is based
> *on*, it's just not possible to understand.
> 
> can i therefore recommend a change, here:
> 
> "Scalable vector types are defined as <vscale x #elts x #eltty>,
> where vscale itself is defined as a positive symbolic constant
> of type integer, representing a platform-dependent (fixed but
> implementor-specific) limit of any given hardware's maximum
> simultaneous "element processing" capacity"
> 
> you could add, in brackets, "(typically the SIMD element width)"
at
> the end there. then, this starts to make sense, but could be further
> made explicit:
> 
> "This patch adds vscale as a symbolic constant to the IR, similar to
> undef and zeroinitializer, so that vscale - representing the
> runtime-detected "element processing" capacity - can be used in
> constant expressions"
> 
> 
>> For this model, predication can be used to disable the lanes
>> that are not needed. Given that `vscale` is defined as inherently
>> constant and a corner-stone of the scalable type, it makes no
>> sense to describe the `vscale` keyword as an intrinsic.
> 
> indeed: if it's intended near-exclusively for SIMD-style hardware,
> then yes, absolutely.
> 
> my only concern would be: some circumstances (some algorithms) may
> perform better with MMX, some with SSE, some with different levels of
> performance on e.g. AMD or Intel, which would, with benchmarking, show
> that some algorithms perform better if vscale=8 (resulting in some
> other MMX/SSE subset being utilised) than if vscale=16.
> 
> in particular, on hardware which doesn't *have* predication,
they're
> definitely in trouble if vscale is fixed (SIMD considered harmful).
> it may even be the case, for whatever reason, that performance sucks
> for AVX512 instructions with a low predicate bitcount, if compared to
> using smaller-range SIMD operations, perhaps due to the vastly-greater
> size of the AVX instructions themselves.
> 
> honestly i don't know: i'm just throwing ideas out, here.
> 
> would it be reasonable to assume that predication *always* is to be
> used in combination with vscale?  or is it the intention to
> [eventually] be able to auto-generate the kinds of [painful in
> retrospect] SIMD assembly shown in the above article?
> 
>> The other model for scalable vectors (2) requires additional intrinsics
>> to get/set the `active VL` at runtime.
> 
> ok.  with you here.
> 
>> This model would be complementary to `vscale`, as it still requires the
>> same scalable vector type to describe a vector of unknown size.
> 
> ah.  that's where the assumption breaks down, because of SV allowing
> its vectors to "sit" on top of the *actual* scalar regfile(s), we
do
> in fact permit an [immediate-specified] vscale to be set, arbitrarily,
> at any time.
> 
> now, we mmmiiiight be able to get away with assuming that vscale is
> equal to the absolute maximum possible setting (64 for RV64, 32 for
> RV32), then use / play-with the "runtime active VL get/set"
> intrinsics.
> 
> i'm kiinda wary of saying "absolutely yes that's the way
forward" for
> us, particularly without some input from Jacob here.
> 
> 
>> `vscale` can be used to express the maximum vector length,
> 
> wait... hang on: RVV i am pretty certain there is not supposed to be
> any kind of assumption of knowledge about MVL.  in SV that's fine, but
> in RVV i don't believe it is.
> 
> bruce, andrew, robin, can you comment here?
> 
>> but the `active vector length` would need to be handled through
>> explicit intrinsics. As Robin explained, it would also need Simon
Moll's
>> vector predication proposal to express operations on `active VL`
elements.
> 
> ok, a link to that would be handy... let me see if i can find it...
> what comes up is this: https://reviews.llvm.org/D57504 is that right?
> 
>>> apologies for asking: these are precisely the kinds of
>>> from-zero-prior-knowledge questions that help with any review
process
>>> to clarify things for other users/devs.
>> No apologies required, the discussion on scalable types have been going
on for quite a while so there are much email threads to read through. It is
important these concepts are clear and well understood!
> 
> :)
> 
>>> clarifying this in the documentation strings on vscale, perhaps
even
>>> providing c-style examples, would be extremely useful, and avoid
>>> misunderstandings.
>> I wonder if we should add a separate document about scalable vectors
>> that describe these concepts in more detail with some examples.
> 
> it's exceptionally complex, with so many variants, i feel this is
> almost essential.
> 
>> Given that (2) is a very different use-case, I hope we can keep
discussions on
>> that model separate from this thread, if possible.
> 
> good idea, if there's a new thread started please do cc me.
> cross-relationship between (2) and vscale may make it slightly
> unavoidable though to involve this one.
> 
> l.

Luke Kenneth Casson Leighton via llvm-dev

2019-Oct-02 03:09 UTC

head link

[llvm-dev] Adding support for vscale

On Wednesday, October 2, 2019, Sander De Smalen <Sander.DeSmalen at
arm.com>
wrote:

It was definitely not my intention to be non-inclusive, my apologies
if> that seemed the case!

No problem Sander.

>
> > can i therefore recommend a change, here:
> > [...]
> > "This patch adds vscale as a symbolic constant to the IR, similar
to
> > undef and zeroinitializer, so that vscale - representing the
> > runtime-detected "element processing" capacity - can be used
in
> > constant expressions"
> Thanks for the suggestion! I like the use of the word `capacity`
> especially now that the term 'vector length' has overloaded
meanings.
> I'll add some extra words to the vscale patch to clarify its meaning.

super. will keep an eye out for it.

>
> > my only concern would be: some circumstances (some algorithms) may
> > perform better with MMX, some with SSE, some with different levels of
> > performance on e.g. AMD or Intel, which would, with benchmarking, show
> > that some algorithms perform better if vscale=8 (resulting in some
> > other MMX/SSE subset being utilised) than if vscale=16.
> If fixed-width/short vectors are more beneficial for some algorithm,
I'd
> recommend using fixed-width vectors directly. It would be up to the target
> to lower that to the vector instruction set. For AArch64, this can be done
> using Neon (max 128bits) or with SVE/SVE2 using a 'fixed-width'
predicate
> mask, e.g. vl4 for a predicate of 4 elements, even when the vector capacity
> is larger than 4.

I have a feeling that this was - is - the "workaround" that Graham was
referring to.

> > would it be reasonable to assume that predication *always* is to be
> > used in combination with vscale?  or is it the intention to
> > [eventually] be able to auto-generate the kinds of [painful in
> > retrospect] SIMD assembly shown in the above article?
>
> When the size of a vector is constant throughout the program, but unknown
> at compile-time, then some form of masking would be required for loads and
> stores (or other instructions that may cause an exception). So it is
> reasonable to assume that predication is used for such vectors.
>
> >> This model would be complementary to `vscale`, as it still
requires the
> >> same scalable vector type to describe a vector of unknown size.
> >
> > ah.  that's where the assumption breaks down, because of SV
allowing
> > its vectors to "sit" on top of the *actual* scalar
regfile(s), we do
> > in fact permit an [immediate-specified] vscale to be set, arbitrarily,
> > at any time.
> Maybe I'm missing something here, but if SV uses an immediate to define
> vscale, that implies the value of vscale is known at compile-time and thus
> regular (fixed-width) vector types can be used?

It's not really intended to be exposed to frontends except by #pragma or
inline assembly.

We *can* set an immediate however by doing so we hard-code the allocated
maximum number of scalar regs to be utilised.

If that is too many then register spill might occur (with disastrous
penalties for 3D) and if too small then performance is poor as ALUs sit
idle.

In addition SV works on RV32 and RV64 where the regfiles are half the
number of total bits and consequently we really will need dynamic scaling,
there, in order to halve the size of vectors rather than risk register
spill.

Plus, if people reeeeeaaally want to not have 128 registers, which there
may be a genuine market need particularly in 3D Embedded, they might
consider the cost of 128 regs to be too great, use the "normal" 32 of
RISCV
instead.

Here they would definitely want vscale=1 and to do everything as close to
scalar operation as possible. If they have vec4 datatypes (using SUBVL)
they might end up with regspill but that is a price they pay for the
decision to reduce the regfile size.

(btw SUBVL is a multiplier of length 2, 3 or 4, representing vec2-4
identical to RVV's subvector.

This is explicitly used in the (c/c++) source code, where MVL immidiates
and VL lengths definitely are not)

> > now, we mmmiiiight be able to get away with assuming that vscale is
> > equal to the absolute maximum possible setting (64 for RV64, 32 for
> > RV32), then use / play-with the "runtime active VL get/set"
> > intrinsics.
> >
> > i'm kiinda wary of saying "absolutely yes that's the way
forward" for
> > us, particularly without some input from Jacob here.
> Note that there isn't a requirement to use `vscale` as proposed in my
> first patch.
>Oh? Ah! That is an important detail :)

One that is tough to express in a short introduction in the docstring
without going into too much detail.

If RV only cares about the runtime active-VL then some explicit,
separate> mechanism to get/set the active VL would be needed anyway. I imagine the
> resulting runtime value (instead of `vscale`) to then be used in loop
> indvar updates, address computations, etc.

Ok this might be the GetOutOfJailFree card I was looking for :)

My general feeling on this then is that both RVV and SV should avoid using
vscale.

In the case of RVV, MVL is a hardware defined constant that is never
*intended* to be known by applications. There's no published detection
mechanism.  Loops are supposed to be designed to run a few more times on
lower spec'd hardware.

Robin, what's your thoughts there?

SV it looks like we will need to do something like <%reg x 4 x f32> which
has an analysis pass to process it, calculating the total number of
available regs for a given block, isolated by LD and ST boundaries, and
maximise %reg to not spill.
> ok, a link to that would be handy... let me see if i can find it...
> > what comes up is this: https://reviews.llvm.org/D57504 is that right?
>
> Yes, that's the one!

Super, encountered it a few months back will read again.

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/cdb4020e/attachment.html>

Robin Kruppe via llvm-dev

2019-Oct-02 07:58 UTC

head link

[llvm-dev] Adding support for vscale

On Wed, 2 Oct 2019 at 05:09, Luke Kenneth Casson Leighton <lkcl at
lkcl.net>
wrote:
>
> My general feeling on this then is that both RVV and SV should avoid using
> vscale.
>
> In the case of RVV, MVL is a hardware defined constant that is never
> *intended* to be known by applications. There's no published detection
> mechanism.  Loops are supposed to be designed to run a few more times on
> lower spec'd hardware.
>
> Robin, what's your thoughts there?
>
Software should be portable across different RVV implementations, in
particular across different values of the impl-defined constants VLEN,
ELEN, SLEN. But being portable does not mean software must never
mention these (and derived quantities such as vscale or, in the RVV spec,
VLMAX) at all, just that it has to work correctly no matter which value
they have. And in fact, there is a published (written out in the spec)
mechanism for obtaining VLMAX, which is directly related to VLEN (so you
can obtain VLEN with a little more arithmetic, though for most purposes
VLMAX is more useful): requesting the vector length of -1 (unsigned: 2^XLEN
- 1) is guaranteed to result in vl=VLMAX.

For regular strip-mined loops, the vsetvl instruction takes care of
everything so there's simply no need for the program to do this. But for
other tasks, it's required (i.e., you can't sensibly write the program
otherwise) and perfectly fine w.r.t. portability. One example is the stack
frame layout when there's any vectors on the stack (e.g. for spills), since
the vector stack slots must in general be large enough to hold a full
vector (= VLEN*LMUL bits). Granted, I don't think this or other examples
will normally occur in LLVM IR generated by a loop vectorizer, so vscale
will probably not occur very frequently in RVV. Nevertheless, there is
nothing inherently non-portable about it.

Regards
Robin

PS: I don't want to read too much into your repeated use of "MVL",
but FWIW
the design of RVV has changed quite radically since "MVL" was last
used in
any spec draft. If you haven't read any version since v0.6 (~ December
2018) with a "clean slate", may I suggest you do that when you find
the
time? You can find the latest draft at
https://github.com/riscv/riscv-v-spec/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/92906cba/attachment.html>

llvm dev - Oct 2019 - Adding support for vscale

[llvm-dev] Adding support for vscale

[llvm-dev] Adding support for vscale

[llvm-dev] Adding support for vscale