thr3ads.net - llvm dev - [llvm-dev] Questions about vscale [Apr 2020]

If this information is useful, please help other people find it:
Share via:

Hanna Kruppe via llvm-dev

2020-Apr-13 17:03 UTC

[llvm-dev] Questions about vscale

On Tue, 7 Apr 2020 at 16:09, Renato Golin <rengolin at gmail.com>
wrote:>
> On Tue, 7 Apr 2020 at 12:51, Hanna Kruppe <hanna.kruppe at gmail.com>
wrote:
> > > 1. is LMUL always a multiple of ELEN?
> > This happens to be true (at least in the current spec, disregarding
> > some in-progress proposals) just because both are powers of two and
> > the largest possible LMUL equals the smallest possible ELEN (8), but I
> > don't think there is any meaning to be found in this observation.
The
> > two values govern unrelated aspects of the vector unit.
>
> Sorry, I meant multiple of basic types. But you have answered my question.
:)
>
> > > 2. Is this fixed on the hardware, depending on the actual
lengths, or
> > > is this dynamically set by software (on a register or status
flag)?
> > > 2a. If dynamic, can it change from program to program? Function
to function?
> > It's not clear whether by "this" you mean ELEN, LMUL, or
something
> > else. ELEN is fixed in hardware. LMUL is a property of each individual
> > instruction.
>
> Sorry again, "this" as in both ELEN and LMUL and their
relationship. Ack.
>
> > I don't know what "vscale wouldn't apply" is
supposed to mean.
>
> Legalisation-wise, you got right, like <n x 0.5 x i64> is invalid and
> gets converted to <n x 1 x i32>, which it is.
>
> "Wouldn't apply" as in "what would be the point of
having half-scale
> on a type that needs to be broken in half", and thus making it whole.
> You explain better below, so ignore it for now.
>
> > But how? If we take Kai's table as gospel and look at a VLEN =
ELEN > > 32 machine, the vector type <vscale x 2 x i32> is supposed
to map to a
> > single vector register, which is 32b small, and thus <vscale x 2 x
> > i32> would have just one element in this context (matching the
"vscale
> > = 1/2" intuition). To be consistent with this, <vscale x 1 x
i32>
> > would have be contain just *half* an element. This is not something
> > any legalization strategy can achieve, because it is a fundamentally
> > impossible notion. So we end up in a situation where some types are
> > not just illegal and have to be legalized, but are contradictory and
> > can't be legalized in any meaningful way.
>
> Right, we have faced that problem before on non-scalable vector extensions.
>
> For example, vectorising 3 operations in a 4-wide vector and adding an
> undef in the last lane.
>
> It didn't use to be possible to do that, many years ago, as a general
> case. But if you look at register aliasing (VFP and NEON in ARMv7), we
> had the idea of different number of elements on the same register,
> depending on how you look.
>
> I'm not proposing to create all combinations of half-vscale shadowing,
> but perhaps adding half-length types as valid and lowering them in a
> special way could work much simpler than changing the interpretation
> of vscale.
[re-sending because I dropped the list -- sorry for the extra copy, Renato!]

I don't see how the situation you mention is comparable. Legalization
for e.g. <3 x i32> was not implemented at first, but as demonstrated
by the fact that it *was* implemented later, there's no conceptual
problem with legalizing that kind of type. You don't even have to
legalize them in vector registers, three scalar registers work fine
(you can even do that on the IR level).

For <vscale x 1 x i32> with a fractional value of vscale, there are
several conceivable ways to "legalize" this type, but none of them
work. Legalization (codegen in general) does not know if the machine
code will eventually run on a chip with vector registers so small that
vscale works out to 1/2, but it has to choose some legalization
strategy. I can imagine several approaches to this, but since the
actual value of vscale is not known at this time, it will have to map
the illegal scalable vector types to the vector registers in some way,
to ensure there's enough space even when vscale is very large in some
executions of the program.

Depending on how you do that exactly, the generated code might have
different behavior when running on a vscale == 1/2 machine, e.g. you
might end up with a vector register holding *one* i32 element or a
vector register holding *zero* i32 elements (i.e., the sole lane of
the 32-bit vector register is masked out). There might be other
approaches that result in yet another behavior, such as a hardware
fault, but crashes and other immediate problems aside, you're going to
end up with a certain discrete number of i32 values. That's a problem.
If <vscale x 1 x i32> ends up having one element, and <vscale x 2 x
i32> also has one (= 2 * 0.5) element, then that's wrong: the latter
type must have twice as many elements as the former (one example where
this matters: split_low / split_high / concat shuffle patterns). The
second option, a vector with *zero* elements, is just as wrong if not
worse.

It's not that a correct legalization exists but it's too annoying to
implement, or that one might exist but I'm too lazy to work it out.
We're also not running in a limitation or oddity of the RISC-V vector
ISA in particular. It's simply that, if you set vscale == 0.5, then by
the way scalable vector types work (vscale * const elements), some
vector types that can be written in the IR would need to have a
fractional number of elements to be consistent with the other scalable
vector types. As that is not possible (not even conceptually),
whatever code you emit to try to legalize that type will end up being
wrong in some respect.

So if we'd decide to support fractional vscale, we can't say these
types are "illegal". In LLVM parlance, illegal types can be used in
LLVM IR and targets aspire to turn them into something that works
correctly, even if it's very inefficient. Sometimes a legalization is
unimplemented or buggy, but these problems can be patched and this has
often happened in the past. With fractional vscale, the situation is
quite different: nobody will ever be able to use certain scalable
vector types on the target in question, because they can't be
legalized even in principle.

In contrast, scalable vector types that are illegal because they're
too large (e.g. <vscale x 32 x i64>) can be legalized just fine. For
example, you could split them across a sufficiently large (fixed)
number of vector registers and maybe spill them to the stack for
inserts/extracts/shuffles/etc. that cross lanes or access elements at
data-dependent positions. Implementing this will probably not be a
priority for any targets, but it can be implemented whenever it does
become important to someone.

I hope this lengthy explanation help you see where I'm coming from.

Thanks,
Hanna
> Also, I'm acting like devil's advocate, so don't take my
comments as a
> rejection of your proposal, I'm just trying to understand where you
> are coming from.
>
> cheers,
> --renato

Renato Golin via llvm-dev

2020-Apr-13 18:04 UTC

head link

[llvm-dev] Questions about vscale

On Mon, 13 Apr 2020 at 18:04, Hanna Kruppe <hanna.kruppe at gmail.com>
wrote:> I don't see how the situation you mention is comparable. Legalization
> for e.g. <3 x i32> was not implemented at first, but as demonstrated
> by the fact that it *was* implemented later, there's no conceptual
> problem with legalizing that kind of type. You don't even have to
> legalize them in vector registers, three scalar registers work fine
> (you can even do that on the IR level).
That was the point I was trying to make, but in my head that fused
with register shadowing, which derailed the point.

To be clear, yes, "invalid" register configurations can easily usually
be legalised in multiple ways at lowering.

Not all will be optimal, though, and there is where the problem lives.
> Legalization (codegen in general) does not know if the machine
> code will eventually run on a chip with vector registers so small that
> vscale works out to 1/2, but it has to choose some legalization
> strategy.
This is interesting, I had not realised that from the descriptions of
the problem so far. I thought it was just due to non-power-of-two
lengths.

A "vector" register that is smaller than 64 bits wouldn't make
much
sense, unless this is a DSP-type extension on very small types. In
those cases, every clock cycle and every instruction counts,
especially inside the inner loop.

I'm struggling to see how this can be optimally executed from a
generic scalable code, which usually profits from the fact that
vscale>> 1.
> If <vscale x 1 x i32> ends up having one element, and <vscale x 2
x
> i32> also has one (= 2 * 0.5) element, then that's wrong: the latter
> type must have twice as many elements as the former (one example where
> this matters: split_low / split_high / concat shuffle patterns). The
> second option, a vector with *zero* elements, is just as wrong if not
> worse.
Right, that was the idea behind vscale from the beginning. I don't
know how many elements either has, but I know the latter has twice as
many as the former.

I see why you would want half-length, because that truth still holds:
the latter has twice as many halves as the former.

But how do you handle the last half? Do you ignore? Do you load /
store half? Do you always mask it out? Do you fuse with the next
iterations' first half?

If the semantics is not clear on how the back-ends are supposed to use
that extra half, then extending the IR in such a way can make it very
hard for generic optimisations to understand anything about the
ranges, validity of operations, alignment, masks, undefined behaviour,
etc.
> It's not that a correct legalization exists but it's too annoying
to
> implement, or that one might exist but I'm too lazy to work it out.
I never meant to imply that. Apologies if that's what came through.
> We're also not running in a limitation or oddity of the RISC-V vector
> ISA in particular. It's simply that, if you set vscale == 0.5, then by
> the way scalable vector types work (vscale * const elements), some
> vector types that can be written in the IR would need to have a
> fractional number of elements to be consistent with the other scalable
> vector types. As that is not possible (not even conceptually),
> whatever code you emit to try to legalize that type will end up being
> wrong in some respect.
Honestly, I'm running out of breath in this discussion. :)

I don't know a lot about SVE and even less about RISC-V, so I'll leave
the more in-depth technical discussions for Florian/Sander and others
to chime in.
> So if we'd decide to support fractional vscale, we can't say these
> types are "illegal". In LLVM parlance, illegal types can be used
in
> LLVM IR and targets aspire to turn them into something that works
> correctly, even if it's very inefficient. Sometimes a legalization is
> unimplemented or buggy, but these problems can be patched and this has
> often happened in the past. With fractional vscale, the situation is
> quite different: nobody will ever be able to use certain scalable
> vector types on the target in question, because they can't be
> legalized even in principle.
I have not spent the time you guys have on this, but if I understood
your problem correctly, I too can't think of a way to represent this
in non-fractional ways.

I'm not saying this is a good idea, and I think you're not saying it
is either, but perhaps the only idea.

If that's the case, then I have proposed to use a different
flag/integer to mean half-scale instead of floating points, and
hopefully that can be transparent to the rest of scalable vector code.

But I'd really like to get other people's point of view, as I'm not
confident on my appraisal.
> I hope this lengthy explanation help you see where I'm coming from.
It did, thanks!

--renato

Kai Wang via llvm-dev

2020-Apr-13 23:12 UTC

head link

[llvm-dev] Questions about vscale

Hi Hanna,

Thanks Hanna. I got your point.
You mean that If the type does not exist in the type system, we still need
to legalize it.
I support the following four kinds of i32 scalable vector types. I also do
not know how to reason about vscale x 1 x i32 under this type system.

          LMUL = 1           LMUL = 2            LMUL = 4            LMUL 8
int32_t | vscale x 2 x i32 | vscale x  4 x i32 | vscale x  8 x i32 | vscale
x 16 x i32

Could we just support the types in the table on the RISC-V target? I mean
do not legalize it, and just issue error messages for vscale x 1 x i32.

In my latest reply, I do not propose fractional vscale. I propose “vscale x
n” be an integer. Under the assumption, I could not reason about vscale x 1
x i32. However, I could reason about vscale x 2 x i32 even when vscale 1/2. We
only care about the part “vscale x n” being integer.

The original problem is the type system proposed by Hanna under ELEN = 64 is

          LMUL = 1           LMUL = 2            LMUL = 4            LMUL 8
int32_t | vscale x 2 x i32 | vscale x  4 x i32 | vscale x  8 x i32 | vscale
x 16 x i32

Under ELEN = 32 is

          LMUL = 1           LMUL = 2            LMUL = 4            LMUL 8
int32_t | vscale x 1 x i32 | vscale x  2 x i32 | vscale x  4 x i32 | vscale
x 8 x i32

The problem is there are multiple kinds of type systems under RISC-V RVV
implementation. They are not compatible under different ELEN
configurations. AFAIK, there are no such compatible problems in GCC
implementation. (In GCC, they reason about the whole “poly_int”, instead of
“X”.)

If llvm.vscale(i32 ElementCount) is not the way we want to go, is there any
proposal to solve the compatibility problems in your type system?

On Tue, Apr 14, 2020 at 1:04 AM Hanna Kruppe <hanna.kruppe at gmail.com>
wrote:
> On Tue, 7 Apr 2020 at 16:09, Renato Golin <rengolin at gmail.com>
wrote:
> >
> > On Tue, 7 Apr 2020 at 12:51, Hanna Kruppe <hanna.kruppe at
gmail.com>
> wrote:
> > > > 1. is LMUL always a multiple of ELEN?
> > > This happens to be true (at least in the current spec,
disregarding
> > > some in-progress proposals) just because both are powers of two
and
> > > the largest possible LMUL equals the smallest possible ELEN (8),
but I
> > > don't think there is any meaning to be found in this
observation. The
> > > two values govern unrelated aspects of the vector unit.
> >
> > Sorry, I meant multiple of basic types. But you have answered my
> question. :)
> >
> > > > 2. Is this fixed on the hardware, depending on the actual
lengths, or
> > > > is this dynamically set by software (on a register or status
flag)?
> > > > 2a. If dynamic, can it change from program to program?
Function to
> function?
> > > It's not clear whether by "this" you mean ELEN,
LMUL, or something
> > > else. ELEN is fixed in hardware. LMUL is a property of each
individual
> > > instruction.
> >
> > Sorry again, "this" as in both ELEN and LMUL and their
relationship. Ack.
> >
> > > I don't know what "vscale wouldn't apply" is
supposed to mean.
> >
> > Legalisation-wise, you got right, like <n x 0.5 x i64> is
invalid and
> > gets converted to <n x 1 x i32>, which it is.
> >
> > "Wouldn't apply" as in "what would be the point of
having half-scale
> > on a type that needs to be broken in half", and thus making it
whole.
> > You explain better below, so ignore it for now.
> >
> > > But how? If we take Kai's table as gospel and look at a VLEN
= ELEN > > > 32 machine, the vector type <vscale x 2 x i32> is
supposed to map to a
> > > single vector register, which is 32b small, and thus <vscale x
2 x
> > > i32> would have just one element in this context (matching the
"vscale
> > > = 1/2" intuition). To be consistent with this, <vscale x
1 x i32>
> > > would have be contain just *half* an element. This is not
something
> > > any legalization strategy can achieve, because it is a
fundamentally
> > > impossible notion. So we end up in a situation where some types
are
> > > not just illegal and have to be legalized, but are contradictory
and
> > > can't be legalized in any meaningful way.
> >
> > Right, we have faced that problem before on non-scalable vector
> extensions.
> >
> > For example, vectorising 3 operations in a 4-wide vector and adding an
> > undef in the last lane.
> >
> > It didn't use to be possible to do that, many years ago, as a
general
> > case. But if you look at register aliasing (VFP and NEON in ARMv7), we
> > had the idea of different number of elements on the same register,
> > depending on how you look.
> >
> > I'm not proposing to create all combinations of half-vscale
shadowing,
> > but perhaps adding half-length types as valid and lowering them in a
> > special way could work much simpler than changing the interpretation
> > of vscale.
>
> [re-sending because I dropped the list -- sorry for the extra copy,
> Renato!]
>
> I don't see how the situation you mention is comparable. Legalization
> for e.g. <3 x i32> was not implemented at first, but as demonstrated
> by the fact that it *was* implemented later, there's no conceptual
> problem with legalizing that kind of type. You don't even have to
> legalize them in vector registers, three scalar registers work fine
> (you can even do that on the IR level).
>
> For <vscale x 1 x i32> with a fractional value of vscale, there are
> several conceivable ways to "legalize" this type, but none of
them
> work. Legalization (codegen in general) does not know if the machine
> code will eventually run on a chip with vector registers so small that
> vscale works out to 1/2, but it has to choose some legalization
> strategy. I can imagine several approaches to this, but since the
> actual value of vscale is not known at this time, it will have to map
> the illegal scalable vector types to the vector registers in some way,
> to ensure there's enough space even when vscale is very large in some
> executions of the program.
>
> Depending on how you do that exactly, the generated code might have
> different behavior when running on a vscale == 1/2 machine, e.g. you
> might end up with a vector register holding *one* i32 element or a
> vector register holding *zero* i32 elements (i.e., the sole lane of
> the 32-bit vector register is masked out). There might be other
> approaches that result in yet another behavior, such as a hardware
> fault, but crashes and other immediate problems aside, you're going to
> end up with a certain discrete number of i32 values. That's a problem.
> If <vscale x 1 x i32> ends up having one element, and <vscale x 2
x
> i32> also has one (= 2 * 0.5) element, then that's wrong: the latter
> type must have twice as many elements as the former (one example where
> this matters: split_low / split_high / concat shuffle patterns). The
> second option, a vector with *zero* elements, is just as wrong if not
> worse.
>
> It's not that a correct legalization exists but it's too annoying
to
> implement, or that one might exist but I'm too lazy to work it out.
> We're also not running in a limitation or oddity of the RISC-V vector
> ISA in particular. It's simply that, if you set vscale == 0.5, then by
> the way scalable vector types work (vscale * const elements), some
> vector types that can be written in the IR would need to have a
> fractional number of elements to be consistent with the other scalable
> vector types. As that is not possible (not even conceptually),
> whatever code you emit to try to legalize that type will end up being
> wrong in some respect.
>
> So if we'd decide to support fractional vscale, we can't say these
> types are "illegal". In LLVM parlance, illegal types can be used
in
> LLVM IR and targets aspire to turn them into something that works
> correctly, even if it's very inefficient. Sometimes a legalization is
> unimplemented or buggy, but these problems can be patched and this has
> often happened in the past. With fractional vscale, the situation is
> quite different: nobody will ever be able to use certain scalable
> vector types on the target in question, because they can't be
> legalized even in principle.
>
> In contrast, scalable vector types that are illegal because they're
> too large (e.g. <vscale x 32 x i64>) can be legalized just fine. For
> example, you could split them across a sufficiently large (fixed)
> number of vector registers and maybe spill them to the stack for
> inserts/extracts/shuffles/etc. that cross lanes or access elements at
> data-dependent positions. Implementing this will probably not be a
> priority for any targets, but it can be implemented whenever it does
> become important to someone.
>
> I hope this lengthy explanation help you see where I'm coming from.
>
> Thanks,
> Hanna
>
> > Also, I'm acting like devil's advocate, so don't take my
comments as a
> > rejection of your proposal, I'm just trying to understand where
you
> > are coming from.
> >
> > cheers,
> > --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200414/0a709aba/attachment-0001.html>

Hanna Kruppe via llvm-dev

2020-Apr-16 14:15 UTC

head link

[llvm-dev] Questions about vscale

On Mon, 13 Apr 2020 at 20:04, Renato Golin <rengolin at gmail.com>
wrote:>
> On Mon, 13 Apr 2020 at 18:04, Hanna Kruppe <hanna.kruppe at
gmail.com> wrote:
> > I don't see how the situation you mention is comparable.
Legalization
> > for e.g. <3 x i32> was not implemented at first, but as
demonstrated
> > by the fact that it *was* implemented later, there's no conceptual
> > problem with legalizing that kind of type. You don't even have to
> > legalize them in vector registers, three scalar registers work fine
> > (you can even do that on the IR level).
>
> That was the point I was trying to make, but in my head that fused
> with register shadowing, which derailed the point.
>
> To be clear, yes, "invalid" register configurations can easily
usually
> be legalised in multiple ways at lowering.
>
> Not all will be optimal, though, and there is where the problem lives.
>
> > Legalization (codegen in general) does not know if the machine
> > code will eventually run on a chip with vector registers so small that
> > vscale works out to 1/2, but it has to choose some legalization
> > strategy.
>
> This is interesting, I had not realised that from the descriptions of
> the problem so far. I thought it was just due to non-power-of-two
> lengths.
>
> A "vector" register that is smaller than 64 bits wouldn't
make much
> sense, unless this is a DSP-type extension on very small types. In
> those cases, every clock cycle and every instruction counts,
> especially inside the inner loop.
>
> I'm struggling to see how this can be optimally executed from a
> generic scalable code, which usually profits from the fact that vscale
> >> 1.
I can understand this kind of concern, but the specification permits
it and this entire thread is predicated on needing to target those
cores too. If we'd decide we're okay with LLVM-based toolchains only
supporting hardware with e.g. VLEN >= 64 (but see [*] below) then
there's no problem to begin with and no need for ideas like fractional
vscale or types that can't be legalized. Alternatively, we could treat
support for vector registers smaller than 64b as a separate ABI, like
with soft-float ABI vs hard-float ABI.

However, the current aspiration among the ISA designers and
software/toolchain developers is different. Cores with tiny VRF are
expected to be useful for some markets, and it's hoped that the V
extension can "scale down" well enough to avoid the need for a second
vector extension specifically for those cores. Personally, I have some
doubts about how well this will work out in practice, but of course
software and toolchain developers (including myself) would prefer to
keep everything as portable as possible. Binary portability across
wildly different vector register sizes is an explicit goal of the ISA,
adding an exception to this for no good reason would be very
unfortunate.

[*] If we settled on requiring VLEN >= 64, we'd still face the same
problem again if we ever want to add support for vectors elements
larger than 64b, such as quad-precision floats or 128 bit integers. I
don't really expect those to be commonly implemented for a long time,
but once again: it would be great to avoid the need for a separate and
incompatible target triple if and when such cores become relevant.
> > If <vscale x 1 x i32> ends up having one element, and <vscale
x 2 x
> > i32> also has one (= 2 * 0.5) element, then that's wrong: the
latter
> > type must have twice as many elements as the former (one example where
> > this matters: split_low / split_high / concat shuffle patterns). The
> > second option, a vector with *zero* elements, is just as wrong if not
> > worse.
>
> Right, that was the idea behind vscale from the beginning. I don't
> know how many elements either has, but I know the latter has twice as
> many as the former.
>
> I see why you would want half-length, because that truth still holds:
> the latter has twice as many halves as the former.
>
> But how do you handle the last half? Do you ignore? Do you load /
> store half? Do you always mask it out? Do you fuse with the next
> iterations' first half?
>
> If the semantics is not clear on how the back-ends are supposed to use
> that extra half, then extending the IR in such a way can make it very
> hard for generic optimisations to understand anything about the
> ranges, validity of operations, alignment, masks, undefined behaviour,
> etc.
>
> > It's not that a correct legalization exists but it's too
annoying to
> > implement, or that one might exist but I'm too lazy to work it
out.
>
> I never meant to imply that. Apologies if that's what came through.
Oh no, not at all! Sorry for the confusion, I should avoid rhetoric
that can create this impression.
> > We're also not running in a limitation or oddity of the RISC-V
vector
> > ISA in particular. It's simply that, if you set vscale == 0.5,
then by
> > the way scalable vector types work (vscale * const elements), some
> > vector types that can be written in the IR would need to have a
> > fractional number of elements to be consistent with the other scalable
> > vector types. As that is not possible (not even conceptually),
> > whatever code you emit to try to legalize that type will end up being
> > wrong in some respect.
>
> Honestly, I'm running out of breath in this discussion. :)
>
> I don't know a lot about SVE and even less about RISC-V, so I'll
leave
> the more in-depth technical discussions for Florian/Sander and others
> to chime in.
Fair, thanks for the discussion so far :)

Best regards
Hanna
> > So if we'd decide to support fractional vscale, we can't say
these
> > types are "illegal". In LLVM parlance, illegal types can be
used in
> > LLVM IR and targets aspire to turn them into something that works
> > correctly, even if it's very inefficient. Sometimes a legalization
is
> > unimplemented or buggy, but these problems can be patched and this has
> > often happened in the past. With fractional vscale, the situation is
> > quite different: nobody will ever be able to use certain scalable
> > vector types on the target in question, because they can't be
> > legalized even in principle.
>
> I have not spent the time you guys have on this, but if I understood
> your problem correctly, I too can't think of a way to represent this
> in non-fractional ways.
>
> I'm not saying this is a good idea, and I think you're not saying
it
> is either, but perhaps the only idea.
>
> If that's the case, then I have proposed to use a different
> flag/integer to mean half-scale instead of floating points, and
> hopefully that can be transparent to the rest of scalable vector code.
>
> But I'd really like to get other people's point of view, as I'm
not
> confident on my appraisal.
>
> > I hope this lengthy explanation help you see where I'm coming
from.
>
> It did, thanks!
>
> --renato

llvm dev - Apr 2020 - Questions about vscale

[llvm-dev] Questions about vscale

[llvm-dev] Questions about vscale

[llvm-dev] Questions about vscale

[llvm-dev] Questions about vscale