thr3ads.net - llvm dev - [llvm-dev] Scalable Vector Types in IR

If this information is useful, please help other people find it:
Share via:

Chandler Carruth via llvm-dev

2019-Mar-19 19:31 UTC

[llvm-dev] Scalable Vector Types in IR - Next Steps?

On Tue, Mar 19, 2019 at 4:11 AM Graham Hunter <Graham.Hunter at arm.com>
wrote:
> Hi Eric and Chandler,
>
> I appreciate your concerns; I don't think the impact will be that
great,
> but then it's
> rather easy for me to keep SVE in mind when working on other parts of the
> codebase
> given how long I've spent working on it.
>
> Are there any additional constraints on the scalable types you think would
> alleviate
> your concerns a little? At the moment we will prevent scalable vectors
> from being
> included in structs and arrays, but we could add more (at least to start
> with) to
> avoid potential hidden problems.
>
While the constraints you mention are good, and important, I don't think
there are more that matter.

>
> I'm also trying to come up with an idea of how much impact we have in
our
> downstream
> implementation; most places where there is divergence are in the AArch64
> backend (as you'd
> expect), followed by the generic SelectionDAG code -- but lowering and
> legalization for
> current instructions should (hopefully) be a one-off.
>
> Do you have any specific parts of the codebase you're interested in a
> report into the
> extent of changes?
>
This is *not* about the changes required. It is about the long term (think
10-years) complexity forced onto the IR.

We now have vectors that are unlike *all other vectors* in the IR. They're
basically unlike all other types. I believe we will be finding bugs with
this special case ~forever. Will it be an untenable burden? Definitely not.
We can manage.

But the question is: does the benefit outweigh the cost? IMO, no.

I completely understand the benefit of this for the *ISA*, and I would
encourage every ISA to adopt some vector instruction set with similar
aspects.

However, the more I talk with and work with my users doing SIMD programming
(and my entire experience doing it personally) leads to me to believe this
will be of extremely limited utility to model in the IR. There will be a
small number of places where it can be used. All of those where performance
matters will end up being tuned for *specific* widths anyways to get the
last few % of performance. Those that aren't performance critical won't
provide any substantial advantage over just being 128-bit vectorized or
left scalar. At that point, we pay the complexity and maintenance cost of
this completely special type in the IR for no material benefit.

I've said this several times in various discussions. My opinion has not
changed. No new information has been presented by others or by me. So I
think debating this technical point is not really interesting at this point.

That said, it is entirely possible that I am wrong about the utility. If
the consensus in the community is that we should move forward, I'm not
going to block forward progress. It sounds like Hal, the Cray folks, and
many ARM folks are all positive. So far, only myself and Eric have said
anything to the contrary. If there really isn't anyone else concerned with
this, please just move forward. I think the cost of continuing to debate
this is rapidly becoming unsustainable all on its own.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190319/303b933c/attachment.html>

Bruce Hoult via llvm-dev

2019-Mar-21 00:56 UTC

head link

[llvm-dev] Scalable Vector Types in IR - Next Steps?

On Tue, Mar 19, 2019 at 12:32 PM Chandler Carruth via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> However, the more I talk with and work with my users doing SIMD programming
(and my entire experience doing it personally) leads to me to believe this will
be of extremely limited utility to model in the IR. There will be a small number
of places where it can be used. All of those where performance matters will end
up being tuned for *specific* widths anyways to get the last few % of
performance. Those that aren't performance critical won't provide any
substantial advantage over just being 128-bit vectorized or left scalar. At that
point, we pay the complexity and maintenance cost of this completely special
type in the IR for no material benefit.
To me, this is nothing like SIMD programming. I've done that, with
VMX/Altivec and NEON.

I've been working with a number of kernels implemented on RISC-V
vectors recently. At least for the things we've been looking at so
far, the code is almost exactly the same as you'd use to implement the
same algorithm (possibly pipelined, unrolled etc) using 32 normal FP
registers, it's just that you work on some unknown-at-compile-time
number of different outer-loop iterations in parallel. For example,
maybe you've got a whole lot of 3x3 matrices to invert. You load each
element of the first matrix into nine registers, then calculate the
determinant, then permute the input values into their new positions
while dividing them by the determinant, and write them all out. It's
exactly the same with the vector ISA, except you might be loading and
working on 1, 2, 4, ... 1000 of the matrices in parallel. You just
don't know, and it doesn't matter. The same for sgemm. You work on
strips eight (say) wide/high. In one dimension you have normal
loads/stores, and in the other dimension you have strided
loads/stores. You're working on rectangular blocks 8 high/wide and
some unknown-at-compile-time amount wide/high -- one some small
machine it might be 1 (i.e. basically a standard FP register file, but
the vector ISA works on it correctly), but presumably on most it will
be something like 4 or 8 or 16 elements. If you unroll either of these
kernels once (or software pipeline it) then you're going to pretty
much saturate your memory system or your fma units or both, depending
on the particular kernel's ratio of compute-to-bytes, how many
functional units you have, and the width of your memory bus.

Maybe you're right and hand-tuned SIMD code with explicit knowledge of
the vector length might get you single-digit percentage better
performance, but it probably won't be more than that and it's a lot of
work.

As for LLVM IR support .. I don't have a firm opinion on whether this
scalable type proposal is sufficient, insufficient, or overkill.

My own gut feeling is that the existing type system is fine for
describing vector data in memory, and that all we need (at least for
RISC-V) is a new register file that is very similar to any machine
with a unified int/fp register file. LLVM needs to manage register
allocation in this register file just as it does for regular int or fp
register files. Spills and reloads of these registers would be
undesirable, but it they are needed then the compiler would have to
allocate the space for this using alloca (or malloc).

The biggest thing needed I think is understanding one unusual
instruction: vsetvl{i}. At the head of each loop you explicitly use
the vsetvl{i} instruction to set the register width (the vector
element width) to something between 8 bits and 1024 bits. The vsetvl
instruction returns an integer which you normally use only to scale by
the element width that you just set, and use the result to bump your
input and output pointers to bump them by N elements instead of 1
element.

So, you kind of need a new type for the registers, but it's purely for
the registers. Not only can you not include it in arrays or structs,
you also can't load it from memory or store it to memory.

The plan for RISC-V is also that all 32 vector registers will be
caller-save/volatile. If you call a function then when it returns you
have to assume that all vector registers have been trashed. There are
no functions using the standard ABI that take vector registers as
arguments or return vector registers as results. The only apparent
exception is the compiler's runtime library that will have things the
compiler explicitly knows about such as transcendental functions --
but they don't use the standard ABI.

Sebastian Pop via llvm-dev

2019-Mar-27 21:33 UTC

head link

[llvm-dev] Scalable Vector Types in IR - Next Steps?

I am of the opinion that handling scalable vectors (SV)
as builtins and an opaque SV type is a good option:

1. The implementation of SV with builtins is simpler than changing the IR.

2. Most of the transforms in opt are scalar opts; they do not optimize
vector operations and will not deal with SV either.

3. With builtins there are fewer places to pay attention to,
as most of the compiler is already dealing with builtins in
a neutral way.

4. The builtin approach is more targeted and confined: it allows
to amend one optimizer at a time.
In the alternative of changing the IR, one has to touch all the
passes in the initial implementation.

5. Optimizing code written with SV intrinsic calls can be done
with about the same implementation effort in both cases
(builtins and changing the IR.)  I do not believe that changing
the IR to add SV types makes any optimizer work magically out
of a sudden: no free lunch. In both cases we need to amend
all the passes that remove inefficiencies in code written with
SV intrinsic calls.

6. We will need a new SV auto-vectorizer pass that relies less on
if-conversion, runtime disambiguation, and unroll for the prolog/epilog,
as the HW is helping with all these cases and expands the number
of loops that can be vectorized.
Having native SV types or just plain builtins is equivalent here
as the code generator of the vectorizer can be improved to not
generate inefficient code.

7. This is my point of view, I may be wrong,
so don't let me slow you down in getting it done!

Sebastian

Finkel, Hal J. via llvm-dev

2019-Mar-27 23:40 UTC

head link

[llvm-dev] Scalable Vector Types in IR - Next Steps?

On 3/27/19 4:33 PM, Sebastian Pop via llvm-dev wrote:> I am of the opinion that handling scalable vectors (SV)
> as builtins and an opaque SV type is a good option:
>
> 1. The implementation of SV with builtins is simpler than changing the IR.
>
> 2. Most of the transforms in opt are scalar opts; they do not optimize
> vector operations and will not deal with SV either.
>
> 3. With builtins there are fewer places to pay attention to,
> as most of the compiler is already dealing with builtins in
> a neutral way.
>
> 4. The builtin approach is more targeted and confined: it allows
> to amend one optimizer at a time.
> In the alternative of changing the IR, one has to touch all the
> passes in the initial implementation.

Interestingly, with similar considerations, I've come to the opposite
conclusion. While in theory the intrinsics and opaque types are more
targeted and confined, this only remains true *if* we don't end up
teaching a bunch of transformations and analysis passes about them.
However, I feel it is inevitable that we will:

 1. While we already have unsized types in the IR, SV will add more of
them, and opaque or otherwise, there will be some cost to making all of
the relevant places in the optimizer not crash in their presence. This
cost we end up paying either way.

 2. We're going to end up wanting to optimize SV operations. If we have
intrinsics, we can add code to match (a + b) - b => a, but the question
is: can we reuse the code in InstCombine which does this? We can make
the answer yes by adding sufficient abstraction, but the code
restructuring seems much worse than just adjusting the type system.
Otherwise, we can't reuse the existing code for these SV optimizations
if we use the intrisics, and we'll be stuck in the unfortunate situation
of slowing rewriting a version of InstCombine just to operate on the SV
intrinsics. Moreover, the code will be worse because we need to
effectively extract the type information from the intrinsic names. By
changing the type system to support SV, it seems like we can reuse
nearly all of the relevant InstCombine code.

 3. It's not just InstCombine (and InstSimplify, etc.), but we might
also need to teach other passes about the intrinsics and their types
(GVN?). It's not clear that the problem will be well confined.

>
> 5. Optimizing code written with SV intrinsic calls can be done
> with about the same implementation effort in both cases
> (builtins and changing the IR.)  I do not believe that changing
> the IR to add SV types makes any optimizer work magically out
> of a sudden: no free lunch. In both cases we need to amend
> all the passes that remove inefficiencies in code written with
> SV intrinsic calls.
>
> 6. We will need a new SV auto-vectorizer pass that relies less on
> if-conversion, runtime disambiguation, and unroll for the prolog/epilog,

It's not obvious to me that this is true. Can you elaborate? Even with
SV, it seems like you still need if conversion and pointer checking, and
unrolling the prologue/epilogue loops is handled later anyway by the
full/partial unrolling pass and I don't see any fundamental change there.

What is true is that we need to change the way that the vectorizer deals
with horizontal operations (e.g., reductions) - these all need to turn
into intrinsics to be handled later. This seems like a positive change,
however.

> as the HW is helping with all these cases and expands the number
> of loops that can be vectorized.
> Having native SV types or just plain builtins is equivalent here
> as the code generator of the vectorizer can be improved to not
> generate inefficient code.

This does not seem equivalent because while the mapping between scalar
operations and SV operations is straightforward with the adjusted type
system, the mapping between the scalar operations and the intrinsics
will require extra infrastructure to implement the mapping. Not that
this is necessarily difficult to build, but it needs to be updated
whenever we otherwise change the IR, and thus adds additional
maintenance cost for all of us.

Thanks again,

Hal

>
> 7. This is my point of view, I may be wrong,
> so don't let me slow you down in getting it done!
>
> Sebastian
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Mar 2019 - Scalable Vector Types in IR - Next Steps?

[llvm-dev] Scalable Vector Types in IR - Next Steps?

[llvm-dev] Scalable Vector Types in IR - Next Steps?

[llvm-dev] Scalable Vector Types in IR - Next Steps?

[llvm-dev] Scalable Vector Types in IR - Next Steps?

Maybe Matching Threads