thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] RFC: Matrix math support [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Florian Hahn via llvm-dev

2019-Oct-28 16:07 UTC

[llvm-dev] RFC: Matrix math support

Hello,

This is a follow-up to Adam’s original RFC (RFC: First-class Matrix type, [1]
<http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html>)
<https://confluence.sd.apple.com/pages/createpage.action?spaceKey=DPT&title=1&linkCreation=true&fromPageId=1360610956>
and his follow-up ([RFC] Matrix support (take 2, [2
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html>]).
The main sticking points of the last thread were centred around representing
padding, propagating shape information (dimensions) and the number of proposed
intrinsics [8]
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>.

We've incorporated the feedback into our design and would like to share our
updated proposal. We stripped it down to the minimum required to realize the
major benefits, like elimination of loads/stores to temporaries and generation
of tuned loops for matrix multiplies. We discuss further extensions and points
that came up during the earlier discussions after the new proposal.

TLDR: Our new proposal boils down to adding 4 matrix related intrinsics
(transpose, multiply, and strided load & store) with shape information and
an IR pass, that lowers the intrinsics to regular vector operations. The IR pass
can also be extended to break down operations into chunks suitable for
specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and
to reducing spilling. We've designed the approach so it's flexible and
easy to extend.
Proposal

After the initial feedback, we evaluated using ‘flattened’ vectors to hold
matrix values, instead of adding a new matrix type, as suggested originally.
This was option #1 suggested in [3]
<http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>.

With that in mind, we propose to add a set of experimental intrinsics for matrix
operations that require information about the shape/layout of the underlying
matrix. The suggested intrinsics take the shape information as additional
(constant) integer arguments and column major layout is assumed initially. With
the new proposal, there are very few intrinsics that actually care about the
memory layout and they can be easily extended to also support row-major layouts.

Initially, we would like to propose the following intrinsics:
<C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32
<N>)
Treat %in as containing a matrix with M rows and N columns and transpose it. 

<C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32
<M>, i32 <N>, i32<K>)
Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and N
columns and multiply them.

<C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32
<N>)
Load a matrix with M rows and N columns, using a stride of %Stride between
columns. This allows for convenient loading of sub matrixes.

void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64 %Stride,
i32 <M>, i32 <N>)
Store a matrix with M rows and N columns, using a stride of %Stride between
columns. This allows for convenient storing of sub matrixes.

The floating point versions of the intrinsics also take fast-math flags, which
can be used to opt-in to FMA generation and/or constant folding opportunities
via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely
on InstCombine & Co for related optimisations.

The intrinsics will be lowered to regular LLVM vector operations in a IR
lowering pass. This means per default, we can lower the builtins on all targets.
Additionally, targets can implement their own lowering to a specialized ISA
extension. We implemented such a lowering for an internal matrix accelerator.

The example below shows how the matrix_transpose builtin will be lowered in the
IR lowering pass.

define void @transpose(<8 x double> * %A, <8 x double>* %B) { 
entry:
%a = load <8 x double>, <8 x double>* %A, align 16
%b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32
2, i32 4)
store <8 x double> %c, <8 x double>* %B, align 16
ret void
}
declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

Will get lowered to something like 

define void @transpose(<8 x double> * %A, <8 x double>* %B) {
entry:
%0 = bitcast <8 x double>* %A to <2 x double>*
%1 = load <2 x double>, <2 x double>* %0, align 8
%2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
%3 = bitcast double* %2 to <2 x double>*
%4 = load <2 x double>, <2 x double>* %3, align 8
%5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
%6 = bitcast double* %5 to <2 x double>*
%7 = load <2 x double>, <2 x double>* %6, align 8
%8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x
i32> <i32 0, i32 1, i32 undef, i32 undef>
%9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
%10 = bitcast double* %9 to <2 x double>*
%11 = load <2 x double>, <2 x double>* %10, align 8
%12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x
i32> <i32 0, i32 1, i32 undef, i32 undef>
%13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x
i32> <i32 0, i32 2, i32 undef, i32 undef>
%14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x
i32> <i32 0, i32 1, i32 4, i32 undef>
%15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x
i32> <i32 0, i32 1, i32 2, i32 4>
%16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x
i32> <i32 1, i32 3, i32 undef, i32 undef>
%17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x
i32> <i32 0, i32 1, i32 5, i32 undef>
%18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x
i32> <i32 0, i32 1, i32 2, i32 5>
%19 = bitcast <8 x double>* %C to <4 x double>*
store <4 x double> %15, <4 x double>* %19, align 8
%20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4
%21 = bitcast double* %20 to <4 x double>*
store <4 x double> %18, <4 x double>* %21, align 8
ret void
}

Before we do the actual lowering, we propagate the shape information from
intrinsics to connected instructions. This allows us to improve the code we
generate for regular IR operations on matrixes embedded in a flattened vector.
In the example above, we propagate the information that we are loading a matrix
with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>*
%A, align 16` and lower it to a series of `load <2 x double>, <2 x
double>*`, which helps with avoiding a large number of shufflevector
instructions to get column vectors. Please note that propagating the shape
information allows us to improve the code we generate during lowering, but is
not required for correctness. Without propagating shape information, we would
just need additional shuffles to extract the rows/columns at the point where we
lower a matrix multiply for example.

This infrastructure can also be used to improve codegen for large vector
operations in general. For example, consider loading 2 <100 x double>
vectors, adding them and storing the result. Currently this will be lowered
naively doing most of the loads first, then most of the adds and then most of
the stores, causing plenty of spilling on AArch64. We can significantly improve
the generated code by breaking it into load/add/store blocks that fit into the
target registers.
We also plan to implement some additional optimisations as part of the lowering,
like generating fused/tiled loops for matrix multiplications and direct FMA
generation.

As future work, we are also evaluating adding additional operations including
clamping a vector or matrix, min/max of matrixes, inverting a matrix and
computing matrix determinates. Some of those might require additional
intrinsics, which we can discuss on a case by case basis.

In terms of integration, we are planning on adding a separate MatrixIRBuilder
utility (as suggested in [4]
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>),
which can be used by frontends to create various matrix operations. For point
wise operations, they will lower to vector instructions directly.
Potential Future Extensions

During the previous discussions, a few extensions were discussed, but we would
like exclude them from the initial implementation. Those are
N-Dimension
Given that the dimensions are represented as arguments to intrinsics, we can go
from supporting 2 dimensions to N dimensions by using variadic arguments for
shape information and handle it accordingly while lowering.
Padding
For small matrixes, it might be desirable to automatically add additional
padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3
matrix, to allow for using vector instructions operating on power-of-2 number of
elements or satisfy an alignment requirement by a target. This allows for
additional optimizations, but is not required for lowering the intrinsics. We
also haven't seen this being an issue so far. We should be able to iterate
on that, once it becomes an issue. (Earlier discussion [5]
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html> )
Support in Clang

We are also planning to expose the matrix support in Clang via a matrix type on
the C/C++ level (similar to the ext_vector_type attribute) together with a set
of matrix builtins. Similar to the LLVM part, we would like to propose a
pragmatic solution for common matrix operations, while being flexible enough to
allow iterative improvements to accommodate additional requirements. The main
motivation for the matrix support on the Clang side is to give users a way to
Guarantee generating high-quality code for matrix operations and trees of matrix
operations. For isolated operations, we can guarantee vector code generation
suitable for the target. For trees of operations, the proposed value type helps
with eliminating temporary loads & stores.
Make use of specialized matrix ISA extensions, like the new matrix instructions
in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
Move optimisations from matrix wrapper libraries into the compiler. We use it
internally to simplify an Eigen-style matrix library, by relying on LLVM for
generating tiled & fused loops for matrix operations.

We propose adding a new matrix value type, that can be declared via a
`matrix_type` attribute. Alternatively we could also generalise the existing
ext_vector_type attribute ([6]
<https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>),
if that is preferred. Initially, the matrix type supports 2 dimensions (rows
& columns). For example, a 4x4 float matrix would be declared as `typedef
float m4x4_t __attribute__((matrix_type(4, 4)));`.

Initially we want to focus on 2D matrixes without padding in column-major layout
as a concrete use case. But our proposed type can be extended naturally to
Support N (known constant) dimensions by turning matrix_type attribute into a
variadic attribute.
Support column/row-wise padding, by adding a column_padding clause to the
attribute.
Dealing with the padding could be exclusively handled on the frontend side, by
emitting additional shufflevector instructions to extract the data. If there is
a desire to exploit the padding more on the LLVM side, we can add a set of
intrinsics for that.
Support row & column major layouts, by adding a layout clause to the
attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using
shufflevector to produce flattened vectors with the required layout. For better
optimisations, the LLVM intrinsics relying on shape/layout information can be
extended to take the layout as additional argument. Through propagating the
layout information similar to the dimensions, we should be able to optimise the
points where we need to transform the layout of the underlying matrixes.
In all cases, we require known constants as dimensions and we do not plan to
support dynamic dimensions for now.

In addition to the type, we propose adding a set of overloaded builtins to
perform operations on matrix values. To start with, we would like to add the
following builtins: __builtin_matrix_add, __builtin_matrix_subtract,
__builtin_matrix_multiply, __builtin_matrix_scalar_multiply,
__builtin_matrix_transpose, __builtin_matrix_insert, __builtin_matrix_extract,
__builtin_matrix_column_load, __builtin_matrix_column_store. Those builtins
allowed us to cover a wide range of uses cases internally, but as mentioned on
the LLVM side already, we are also evaluating adding additional operations, like
clamping matrixes or getting the minimum/maximum of a matrix. We should be able
to iterate on the set of operations, as required by actual users.

In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see LLVM
part), that can lower the various builtins to LLVM IR (in terms of matrix
builtins or vector operations, depending on the builtin).



We currently have an internal implementation based on Adam’s original proposal
('first class matrix type') and are using that for a large internal
project in production. We also prototyped the approach outlined in this proposal
('flattened matrixes') to make sure we can produce equivalent code (and
thus performance) with both approaches on a large codebase.

We think predication is out of scope for the initial proposal and the right
forum to discuss predication support in general are the related discussions on
the list, the LLVM-VP proposal, [7] <https://reviews.llvm.org/D57504> and
the round table at the Dev Meeting.

We think our current proposal addresses the concerns raised previously,
especially the concerns around the high cost of adding a new IR type, adding too
many new intrinsics and generalising the approach to N dimensions. Unless there
are any additional major concerns, we should be able to share patches for review
soon.

Cheers,
Florian


[1] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html
<http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html>
[2] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html>
[3] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html
<http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>
[4] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>
[5] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html>
[6]
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
<https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>
[7] https://reviews.llvm.org/D57504 <https://reviews.llvm.org/D57504>
[8] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191028/9c39b45b/attachment-0001.html>

Chris Lattner via llvm-dev

2019-Oct-29 05:09 UTC

head link

[llvm-dev] [cfe-dev] RFC: Matrix math support

Hi Florian, thanks for the update!
> On Oct 28, 2019, at 9:07 AM, Florian Hahn via cfe-dev <cfe-dev at
lists.llvm.org> wrote:
...> After the initial feedback, we evaluated using ‘flattened’ vectors to hold
matrix values, instead of adding a new matrix type, as suggested originally.
This was option #1 suggested in [3]
<http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>.
Nice!
> 
> With that in mind, we propose to add a set of experimental intrinsics for
matrix operations that require information about the shape/layout of the
underlying matrix. The suggested intrinsics take the shape information as
additional (constant) integer arguments and column major layout is assumed
initially. With the new proposal, there are very few intrinsics that actually
care about the memory layout and they can be easily extended to also support
row-major layouts.
> 
> Initially, we would like to propose the following intrinsics:
> <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32
<N>)
> Treat %in as containing a matrix with M rows and N columns and transpose
it.
> 
> <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32
<M>, i32 <N>, i32<K>)
> Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and
N columns and multiply them.
> 
> <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>,
i32 <N>)
> Load a matrix with M rows and N columns, using a stride of %Stride between
columns. This allows for convenient loading of sub matrixes.
> 
> void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64
%Stride, i32 <M>, i32 <N>)
> Store a matrix with M rows and N columns, using a stride of %Stride between
columns. This allows for convenient storing of sub matrixes.This all seems reasonable to me - a constrained extension that can start out as
experimental.  I personally don’t have any significant objections to this.
> The floating point versions of the intrinsics also take fast-math flags,
which can be used to opt-in to FMA generation and/or constant folding
opportunities via NoInfs and NoNaNs. We plan to add them to the lowered
instructions and rely on InstCombine & Co for related optimisations.
What is your though on the FP semantics of matmul?  There is a lot of room for
interpretation and various ‘fast’ approximations that are occasionally
interesting.  Should this use the existing fast math flags or are they
different?
> The intrinsics will be lowered to regular LLVM vector operations in a IR
lowering pass. This means per default, we can lower the builtins on all targets.
Nice!
> Before we do the actual lowering, we propagate the shape information from
intrinsics to connected instructions. This allows us to improve the code we
generate for regular IR operations on matrixes embedded in a flattened vector.
In the example above, we propagate the information that we are loading a matrix
with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>*
%A, align 16` and lower it to a series of `load <2 x double>, <2 x
double>*`, which helps with avoiding a large number of shufflevector
instructions to get column vectors. Please note that propagating the shape
information allows us to improve the code we generate during lowering, but is
not required for correctness. Without propagating shape information, we would
just need additional shuffles to extract the rows/columns at the point where we
lower a matrix multiply for example.
I’m very glad to hear that this only affects performance but not correctness. 
This ensures that the extension is correct in the face of outlining and various
code motion passes that can introduces weird phi nodes that may (in unusual
cases) escape the limits of your reasoning.
> As future work, we are also evaluating adding additional operations
including clamping a vector or matrix, min/max of matrixes, inverting a matrix
and computing matrix determinates. Some of those might require additional
intrinsics, which we can discuss on a case by case basis.
+1.  It would be interesting to extend the set of target independent vector
intrinsics in general.
> Potential Future Extensions
> 
> 
> Padding
> For small matrixes, it might be desirable to automatically add additional
padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3
matrix, to allow for using vector instructions operating on power-of-2 number of
elements or satisfy an alignment requirement by a target. This allows for
additional optimizations, but is not required for lowering the intrinsics. We
also haven't seen this being an issue so far. We should be able to iterate
on that, once it becomes an issue. (Earlier discussion [5]
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html> )Random thought, but would it be enough to model these as undef elements if they
became important?
> We propose adding a new matrix value type, that can be declared via a
`matrix_type` attribute. Alternatively we could also generalise the existing
ext_vector_type attribute ([6]
<https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>),
if that is preferred.
I don’t care *strongly* about this at all, but I have a weak preference for
introducing a new attribute for this.  There is effectively no cost to doing so,
and this would make the documentation in the clang extensions manual much easier
to understand.
> In all cases, we require known constants as dimensions and we do not plan
to support dynamic dimensions for now.
I’m sure the clang folks will want to know what you mean by ‘constant’s in terms
of ICE, constexpr, integer template arguments, etc...
> We think our current proposal addresses the concerns raised previously,
especially the concerns around the high cost of adding a new IR type, adding too
many new intrinsics and generalising the approach to N dimensions. Unless there
are any additional major concerns, we should be able to share patches for review
soon.
+1, sounds like a very promising direction, thank you!

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191028/ab4d1fc6/attachment.html>

Florian Hahn via llvm-dev

2019-Oct-29 21:20 UTC

head link

[llvm-dev] [cfe-dev] RFC: Matrix math support

Hi Chris,

Thanks for taking a look!
> On Oct 29, 2019, at 05:09, Chris Lattner <clattner at nondot.org>
wrote:
> 
> Hi Florian, thanks for the update!
> 
>> On Oct 28, 2019, at 9:07 AM, Florian Hahn via cfe-dev <cfe-dev at
lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> ...
>> After the initial feedback, we evaluated using ‘flattened’ vectors to
hold matrix values, instead of adding a new matrix type, as suggested
originally. This was option #1 suggested in [3]
<http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>.
> 
> Nice!
> 
>> 
>> With that in mind, we propose to add a set of experimental intrinsics
for matrix operations that require information about the shape/layout of the
underlying matrix. The suggested intrinsics take the shape information as
additional (constant) integer arguments and column major layout is assumed
initially. With the new proposal, there are very few intrinsics that actually
care about the memory layout and they can be easily extended to also support
row-major layouts.
>> 
>> Initially, we would like to propose the following intrinsics:
>> <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32
<N>)
>> Treat %in as containing a matrix with M rows and N columns and
transpose it.
>> 
>> <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y,
i32 <M>, i32 <N>, i32<K>)
>> Treat %X as matrix with M rows and K columns, %Y as matrix with K rows
and N columns and multiply them.
>> 
>> <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32
<M>, i32 <N>)
>> Load a matrix with M rows and N columns, using a stride of %Stride
between columns. This allows for convenient loading of sub matrixes.
>> 
>> void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64
%Stride, i32 <M>, i32 <N>)
>> Store a matrix with M rows and N columns, using a stride of %Stride
between columns. This allows for convenient storing of sub matrixes.
> This all seems reasonable to me - a constrained extension that can start
out as experimental.  I personally don’t have any significant objections to
this.
> 
>> The floating point versions of the intrinsics also take fast-math
flags, which can be used to opt-in to FMA generation and/or constant folding
opportunities via NoInfs and NoNaNs. We plan to add them to the lowered
instructions and rely on InstCombine & Co for related optimisations.
> 
> What is your though on the FP semantics of matmul?  There is a lot of room
for interpretation and various ‘fast’ approximations that are occasionally
interesting.  Should this use the existing fast math flags or are they
different?
Initially we thought of using the existing fast math flags directly, apply them
to the generated instruction while lowering and rely on the existing related
optimizations in InstCombine & co. But it should be possible to handle
additional FP semantics directly while doing the lowering, if there are users
that could benefit.
> 
>> The intrinsics will be lowered to regular LLVM vector operations in a
IR lowering pass. This means per default, we can lower the builtins on all
targets.
> 
> Nice!
> 
>> Before we do the actual lowering, we propagate the shape information
from intrinsics to connected instructions. This allows us to improve the code we
generate for regular IR operations on matrixes embedded in a flattened vector.
In the example above, we propagate the information that we are loading a matrix
with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>*
%A, align 16` and lower it to a series of `load <2 x double>, <2 x
double>*`, which helps with avoiding a large number of shufflevector
instructions to get column vectors. Please note that propagating the shape
information allows us to improve the code we generate during lowering, but is
not required for correctness. Without propagating shape information, we would
just need additional shuffles to extract the rows/columns at the point where we
lower a matrix multiply for example.
> 
> I’m very glad to hear that this only affects performance but not
correctness.  This ensures that the extension is correct in the face of
outlining and various code motion passes that can introduces weird phi nodes
that may (in unusual cases) escape the limits of your reasoning.
> 
>> As future work, we are also evaluating adding additional operations
including clamping a vector or matrix, min/max of matrixes, inverting a matrix
and computing matrix determinates. Some of those might require additional
intrinsics, which we can discuss on a case by case basis.
> 
> +1.  It would be interesting to extend the set of target independent vector
intrinsics in general.
> 
>> Potential Future Extensions
>> 
>> 
>> Padding
>> For small matrixes, it might be desirable to automatically add
additional padding to columns/rows, e.g. add 1 element padding to each column in
a 3 x 3 matrix, to allow for using vector instructions operating on power-of-2
number of elements or satisfy an alignment requirement by a target. This allows
for additional optimizations, but is not required for lowering the intrinsics.
We also haven't seen this being an issue so far. We should be able to
iterate on that, once it becomes an issue. (Earlier discussion [5]
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html> )
> Random thought, but would it be enough to model these as undef elements if
they became important?
Interesting! Yes, I think we could use undef while lowering the builtins in
clang, to deal with padding.
> 
>> We propose adding a new matrix value type, that can be declared via a
`matrix_type` attribute. Alternatively we could also generalise the existing
ext_vector_type attribute ([6]
<https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>),
if that is preferred.
> 
> I don’t care *strongly* about this at all, but I have a weak preference for
introducing a new attribute for this.  There is effectively no cost to doing so,
and this would make the documentation in the clang extensions manual much easier
to understand.
> 
>> In all cases, we require known constants as dimensions and we do not
plan to support dynamic dimensions for now.
> 
> I’m sure the clang folks will want to know what you mean by ‘constant’s in
terms of ICE, constexpr, integer template arguments, etc...
> 
>> We think our current proposal addresses the concerns raised previously,
especially the concerns around the high cost of adding a new IR type, adding too
many new intrinsics and generalising the approach to N dimensions. Unless there
are any additional major concerns, we should be able to share patches for review
soon.
> 
> +1, sounds like a very promising direction, thank you!
> 
Thanks for all the valuable input during the earlier discussions!

Cheers,
Florian

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191029/9eeb74ee/attachment-0001.html>

Florian Hahn via llvm-dev

2019-Nov-04 20:17 UTC

head link

[llvm-dev] RFC: Matrix math support

Thanks a lot for the comments so far!

I’m planning to start preparing a set of initial patches for the LLVM side in a
couple of days. Please let me know if there are any remaining high-level
concerns.

I received additional feedback off-list to try to remove the transpose and
load/store intrinsics initially. I'll probably exclude them from the initial
set of patches (we currently mostly have them for convenience) and add them as
we go along, if required.

Cheers,
Florian
> On Oct 28, 2019, at 17:07, Florian Hahn via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello,
> 
> This is a follow-up to Adam’s original RFC (RFC: First-class Matrix type,
[1]) and his follow-up ([RFC] Matrix support (take 2, [2]). The main sticking
points of the last thread were centred around representing padding, propagating
shape information (dimensions) and the number of proposed intrinsics [8].
> 
> We've incorporated the feedback into our design and would like to share
our updated proposal. We stripped it down to the minimum required to realize the
major benefits, like elimination of loads/stores to temporaries and generation
of tuned loops for matrix multiplies. We discuss further extensions and points
that came up during the earlier discussions after the new proposal.
> 
> TLDR: Our new proposal boils down to adding 4 matrix related intrinsics
(transpose, multiply, and strided load & store) with shape information and
an IR pass, that lowers the intrinsics to regular vector operations. The IR pass
can also be extended to break down operations into chunks suitable for
specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and
to reducing spilling. We've designed the approach so it's flexible and
easy to extend.
> Proposal
> 
> After the initial feedback, we evaluated using ‘flattened’ vectors to hold
matrix values, instead of adding a new matrix type, as suggested originally.
This was option #1 suggested in [3].
> 
> With that in mind, we propose to add a set of experimental intrinsics for
matrix operations that require information about the shape/layout of the
underlying matrix. The suggested intrinsics take the shape information as
additional (constant) integer arguments and column major layout is assumed
initially. With the new proposal, there are very few intrinsics that actually
care about the memory layout and they can be easily extended to also support
row-major layouts.
> 
> Initially, we would like to propose the following intrinsics:
> 	• <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32
<N>)
> Treat %in as containing a matrix with M rows and N columns and transpose
it.
> 
> 	• <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32
<M>, i32 <N>, i32<K>)
> Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and
N columns and multiply them.
> 
> 	• <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32
<M>, i32 <N>)
> Load a matrix with M rows and N columns, using a stride of %Stride between
columns. This allows for convenient loading of sub matrixes.
> 
> 	• void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64
%Stride, i32 <M>, i32 <N>)
> Store a matrix with M rows and N columns, using a stride of %Stride between
columns. This allows for convenient storing of sub matrixes.
> 
> The floating point versions of the intrinsics also take fast-math flags,
which can be used to opt-in to FMA generation and/or constant folding
opportunities via NoInfs and NoNaNs. We plan to add them to the lowered
instructions and rely on InstCombine & Co for related optimisations.
> 
> The intrinsics will be lowered to regular LLVM vector operations in a IR
lowering pass. This means per default, we can lower the builtins on all targets.
Additionally, targets can implement their own lowering to a specialized ISA
extension. We implemented such a lowering for an internal matrix accelerator.
> 
> The example below shows how the matrix_transpose builtin will be lowered in
the IR lowering pass.
> 
> define void @transpose(<8 x double> * %A, <8 x double>* %B) { 
> entry:
> %a = load <8 x double>, <8 x double>* %A, align 16
> %b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a,
i32 2, i32 4)
> store <8 x double> %c, <8 x double>* %B, align 16
> ret void
> }
> declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32,
i32)
> 
> Will get lowered to something like 
> 
> define void @transpose(<8 x double> * %A, <8 x double>* %B) {
> entry:
> %0 = bitcast <8 x double>* %A to <2 x double>*
> %1 = load <2 x double>, <2 x double>* %0, align 8
> %2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
> %3 = bitcast double* %2 to <2 x double>*
> %4 = load <2 x double>, <2 x double>* %3, align 8
> %5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
> %6 = bitcast double* %5 to <2 x double>*
> %7 = load <2 x double>, <2 x double>* %6, align 8
> %8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x
i32> <i32 0, i32 1, i32 undef, i32 undef>
> %9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
> %10 = bitcast double* %9 to <2 x double>*
> %11 = load <2 x double>, <2 x double>* %10, align 8
> %12 = shufflevector <2 x double> %11, <2 x double> undef, <4
x i32> <i32 0, i32 1, i32 undef, i32 undef>
> %13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x
i32> <i32 0, i32 2, i32 undef, i32 undef>
> %14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x
i32> <i32 0, i32 1, i32 4, i32 undef>
> %15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x
i32> <i32 0, i32 1, i32 2, i32 4>
> %16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x
i32> <i32 1, i32 3, i32 undef, i32 undef>
> %17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x
i32> <i32 0, i32 1, i32 5, i32 undef>
> %18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x
i32> <i32 0, i32 1, i32 2, i32 5>
> %19 = bitcast <8 x double>* %C to <4 x double>*
> store <4 x double> %15, <4 x double>* %19, align 8
> %20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64
4
> %21 = bitcast double* %20 to <4 x double>*
> store <4 x double> %18, <4 x double>* %21, align 8
> ret void
> }
> 
> Before we do the actual lowering, we propagate the shape information from
intrinsics to connected instructions. This allows us to improve the code we
generate for regular IR operations on matrixes embedded in a flattened vector.
In the example above, we propagate the information that we are loading a matrix
with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>*
%A, align 16` and lower it to a series of `load <2 x double>, <2 x
double>*`, which helps with avoiding a large number of shufflevector
instructions to get column vectors. Please note that propagating the shape
information allows us to improve the code we generate during lowering, but is
not required for correctness. Without propagating shape information, we would
just need additional shuffles to extract the rows/columns at the point where we
lower a matrix multiply for example.
> 
> This infrastructure can also be used to improve codegen for large vector
operations in general. For example, consider loading 2 <100 x double>
vectors, adding them and storing the result. Currently this will be lowered
naively doing most of the loads first, then most of the adds and then most of
the stores, causing plenty of spilling on AArch64. We can significantly improve
the generated code by breaking it into load/add/store blocks that fit into the
target registers.
> We also plan to implement some additional optimisations as part of the
lowering, like generating fused/tiled loops for matrix multiplications and
direct FMA generation.
> 
> As future work, we are also evaluating adding additional operations
including clamping a vector or matrix, min/max of matrixes, inverting a matrix
and computing matrix determinates. Some of those might require additional
intrinsics, which we can discuss on a case by case basis.
> 
> In terms of integration, we are planning on adding a separate
MatrixIRBuilder utility (as suggested in [4]), which can be used by frontends to
create various matrix operations. For point wise operations, they will lower to
vector instructions directly.
> Potential Future Extensions
> 
> During the previous discussions, a few extensions were discussed, but we
would like exclude them from the initial implementation. Those are
> 	• N-Dimension
> Given that the dimensions are represented as arguments to intrinsics, we
can go from supporting 2 dimensions to N dimensions by using variadic arguments
for shape information and handle it accordingly while lowering.
> 	• Padding
> For small matrixes, it might be desirable to automatically add additional
padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3
matrix, to allow for using vector instructions operating on power-of-2 number of
elements or satisfy an alignment requirement by a target. This allows for
additional optimizations, but is not required for lowering the intrinsics. We
also haven't seen this being an issue so far. We should be able to iterate
on that, once it becomes an issue. (Earlier discussion [5] )
> Support in Clang
> 
> We are also planning to expose the matrix support in Clang via a matrix
type on the C/C++ level (similar to the ext_vector_type attribute) together with
a set of matrix builtins. Similar to the LLVM part, we would like to propose a
pragmatic solution for common matrix operations, while being flexible enough to
allow iterative improvements to accommodate additional requirements. The main
motivation for the matrix support on the Clang side is to give users a way to
> 	• Guarantee generating high-quality code for matrix operations and trees
of matrix operations. For isolated operations, we can guarantee vector code
generation suitable for the target. For trees of operations, the proposed value
type helps with eliminating temporary loads & stores.
> 	• Make use of specialized matrix ISA extensions, like the new matrix
instructions in ARM v8.6 or various proprietary matrix accelerators, in their
C/C++ code.
> 	• Move optimisations from matrix wrapper libraries into the compiler. We
use it internally to simplify an Eigen-style matrix library, by relying on LLVM
for generating tiled & fused loops for matrix operations.
> 
> We propose adding a new matrix value type, that can be declared via a
`matrix_type` attribute. Alternatively we could also generalise the existing
ext_vector_type attribute ([6]), if that is preferred. Initially, the matrix
type supports 2 dimensions (rows & columns). For example, a 4x4 float matrix
would be declared as `typedef float m4x4_t __attribute__((matrix_type(4, 4)));`.
> 
> Initially we want to focus on 2D matrixes without padding in column-major
layout as a concrete use case. But our proposed type can be extended naturally
to
> 	• Support N (known constant) dimensions by turning matrix_type attribute
into a variadic attribute.
> 	• Support column/row-wise padding, by adding a column_padding clause to
the attribute.
> Dealing with the padding could be exclusively handled on the frontend side,
by emitting additional shufflevector instructions to extract the data. If there
is a desire to exploit the padding more on the LLVM side, we can add a set of
intrinsics for that.
> 	• Support row & column major layouts, by adding a layout clause to the
attribute.
> Again, this naively could be handled while lowering to LLVM IR in Clang
using shufflevector to produce flattened vectors with the required layout. For
better optimisations, the LLVM intrinsics relying on shape/layout information
can be extended to take the layout as additional argument. Through propagating
the layout information similar to the dimensions, we should be able to optimise
the points where we need to transform the layout of the underlying matrixes.
> In all cases, we require known constants as dimensions and we do not plan
to support dynamic dimensions for now.
> 
> In addition to the type, we propose adding a set of overloaded builtins to
perform operations on matrix values. To start with, we would like to add the
following builtins: __builtin_matrix_add, __builtin_matrix_subtract,
__builtin_matrix_multiply, __builtin_matrix_scalar_multiply,
__builtin_matrix_transpose, __builtin_matrix_insert, __builtin_matrix_extract,
__builtin_matrix_column_load, __builtin_matrix_column_store. Those builtins
allowed us to cover a wide range of uses cases internally, but as mentioned on
the LLVM side already, we are also evaluating adding additional operations, like
clamping matrixes or getting the minimum/maximum of a matrix. We should be able
to iterate on the set of operations, as required by actual users.
> 
> In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see
LLVM part), that can lower the various builtins to LLVM IR (in terms of matrix
builtins or vector operations, depending on the builtin).
> 
> 
> 
> We currently have an internal implementation based on Adam’s original
proposal ('first class matrix type') and are using that for a large
internal project in production. We also prototyped the approach outlined in this
proposal ('flattened matrixes') to make sure we can produce equivalent
code (and thus performance) with both approaches on a large codebase.
> 
> We think predication is out of scope for the initial proposal and the right
forum to discuss predication support in general are the related discussions on
the list, the LLVM-VP proposal, [7] and the round table at the Dev Meeting.
> 
> We think our current proposal addresses the concerns raised previously,
especially the concerns around the high cost of adding a new IR type, adding too
many new intrinsics and generalising the approach to N dimensions. Unless there
are any additional major concerns, we should be able to share patches for review
soon.
> 
> Cheers,
> Florian
> 
> 
> [1] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html
> [2] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html
> [3] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html
> [4] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
> [5] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html
> [6]
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
> [7] https://reviews.llvm.org/D57504
> [8] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Florian Hahn via llvm-dev

2019-Nov-19 20:29 UTC

head link

[llvm-dev] RFC: Matrix math support

Hi,

I just uploaded the initial patch adding the matrix intrinsics as described, as
well as an initial lowering pass: https://reviews.llvm.org/D70456

The first version of the patch works without any shape propagation and lowers
each intrinsic independently. We plan to iterate on the pass once it is in, to
improve the generated code as discussed in the RFC.

In a bit, I’ll also share a patch for the clang side, to get a concrete
discussion started about the Clang side.

Thanks,
Florian

Richard Smith via llvm-dev

2019-Nov-19 21:24 UTC

head link

[llvm-dev] [cfe-dev] RFC: Matrix math support

On Mon, 28 Oct 2019 at 09:08, Florian Hahn via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
> Hello,This is a follow-up to Adam’s original RFC (RFC: First-class Matrix
> type, [1]
> <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html>)
>
<https://confluence.sd.apple.com/pages/createpage.action?spaceKey=DPT&title=1&linkCreation=true&fromPageId=1360610956>
and
> his follow-up ([RFC] Matrix support (take 2, [2
>
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html>]).
> The main sticking points of the last thread were centred around
> representing padding, propagating shape information (dimensions) and the
> number of proposed intrinsics [8]
>
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>.We've
> incorporated the feedback into our design and would like to share our
> updated proposal. We stripped it down to the minimum required to realize
> the major benefits, like elimination of loads/stores to temporaries and
> generation of tuned loops for matrix multiplies. We discuss further
> extensions and points that came up during the earlier discussions after the
> new proposal.
>
> TLDR: Our new proposal boils down to adding 4 matrix related intrinsics
> (transpose, multiply, and strided load & store) with shape information
and
> an IR pass, that lowers the intrinsics to regular vector operations. The IR
> pass can also be extended to break down operations into chunks suitable for
> specialised matrix ISAs (like the matrix instructions included in ARM v8.6)
> and to reducing spilling. We've designed the approach so it's
flexible and
> easy to extend.
> ProposalAfter the initial feedback, we evaluated using ‘flattened’
> vectors to hold matrix values, instead of adding a new matrix type, as
> suggested originally. This was option #1 suggested in [3]
> <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>.
>
> With that in mind, we propose to add a set of experimental intrinsics for
> matrix operations that require information about the shape/layout of the
> underlying matrix. The suggested intrinsics take the shape information as
> additional (constant) integer arguments and column major layout is assumed
> initially. With the new proposal, there are very few intrinsics that
> actually care about the memory layout and they can be easily extended to
> also support row-major layouts.
>
> Initially, we would like to propose the following intrinsics:
>
>    - <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32
<N>)
>    Treat %in as containing a matrix with M rows and N columns and
>    transpose it.
>
>    - <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y,
i32 <M>, i32 <N>,
>    i32<K>)
>    Treat %X as matrix with M rows and K columns, %Y as matrix with K rows
>    and N columns and multiply them.
>
>    - <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32
<M>, i32
>    <N>)
>    Load a matrix with M rows and N columns, using a stride of %Stride
>    between columns. This allows for convenient loading of sub matrixes.
>
>    - void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64
>    %Stride, i32 <M>, i32 <N>)
>    Store a matrix with M rows and N columns, using a stride of %Stride
>    between columns. This allows for convenient storing of sub matrixes.
>
>
> The floating point versions of the intrinsics also take fast-math flags,
> which can be used to opt-in to FMA generation and/or constant folding
> opportunities via NoInfs and NoNaNs. We plan to add them to the lowered
> instructions and rely on InstCombine & Co for related optimisations.
>
> The intrinsics will be lowered to regular LLVM vector operations in a IR
> lowering pass. This means per default, we can lower the builtins on all
> targets. Additionally, targets can implement their own lowering to a
> specialized ISA extension. We implemented such a lowering for an internal
> matrix accelerator.
>
> The example below shows how the matrix_transpose builtin will be lowered
> in the IR lowering pass.
>
> define void @transpose(<8 x double> * %A, <8 x double>* %B) {
> entry:
> %a = load <8 x double>, <8 x double>* %A, align 16
> %b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a,
i32 2, i32
> 4)
> store <8 x double> %c, <8 x double>* %B, align 16
> ret void
> }
> declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32,
i32)
>
> Will get lowered to something like
>
> define void @transpose(<8 x double> * %A, <8 x double>* %B) {
> entry:
> %0 = bitcast <8 x double>* %A to <2 x double>*
> %1 = load <2 x double>, <2 x double>* %0, align 8
> %2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
> %3 = bitcast double* %2 to <2 x double>*
> %4 = load <2 x double>, <2 x double>* %3, align 8
> %5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
> %6 = bitcast double* %5 to <2 x double>*
> %7 = load <2 x double>, <2 x double>* %6, align 8
> %8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x
i32> <i32 0,
> i32 1, i32 undef, i32 undef>
> %9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
> %10 = bitcast double* %9 to <2 x double>*
> %11 = load <2 x double>, <2 x double>* %10, align 8
> %12 = shufflevector <2 x double> %11, <2 x double> undef, <4
x i32> <i32
> 0, i32 1, i32 undef, i32 undef>
> %13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x
i32> <i32 0,
> i32 2, i32 undef, i32 undef>
> %14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x
i32> <i32 0,
> i32 1, i32 4, i32 undef>
> %15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x
i32> <i32 0,
> i32 1, i32 2, i32 4>
> %16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x
i32> <i32 1,
> i32 3, i32 undef, i32 undef>
> %17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x
i32> <i32 0,
> i32 1, i32 5, i32 undef>
> %18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x
i32> <i32 0,
> i32 1, i32 2, i32 5>
> %19 = bitcast <8 x double>* %C to <4 x double>*
> store <4 x double> %15, <4 x double>* %19, align 8
> %20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64
4
> %21 = bitcast double* %20 to <4 x double>*
> store <4 x double> %18, <4 x double>* %21, align 8
> ret void
> }
>
> Before we do the actual lowering, we propagate the shape information from
> intrinsics to connected instructions. This allows us to improve the code we
> generate for regular IR operations on matrixes embedded in a flattened
> vector. In the example above, we propagate the information that we are
> loading a matrix with 2 rows and 4 columns to `%a = load <8 x
double>, <8 x
> double>* %A, align 16` and lower it to a series of `load <2 x
double>, <2 x
> double>*`, which helps with avoiding a large number of shufflevector
> instructions to get column vectors. Please note that propagating the shape
> information allows us to improve the code we generate during lowering, but
> is not required for correctness. Without propagating shape information, we
> would just need additional shuffles to extract the rows/columns at the
> point where we lower a matrix multiply for example.
>
> This infrastructure can also be used to improve codegen for large vector
> operations in general. For example, consider loading 2 <100 x double>
> vectors, adding them and storing the result. Currently this will be lowered
> naively doing most of the loads first, then most of the adds and then most
> of the stores, causing plenty of spilling on AArch64. We can significantly
> improve the generated code by breaking it into load/add/store blocks that
> fit into the target registers.
> We also plan to implement some additional optimisations as part of the
> lowering, like generating fused/tiled loops for matrix multiplications and
> direct FMA generation.
>
> As future work, we are also evaluating adding additional operations
> including clamping a vector or matrix, min/max of matrixes, inverting a
> matrix and computing matrix determinates. Some of those might require
> additional intrinsics, which we can discuss on a case by case basis.
>
> In terms of integration, we are planning on adding a separate
> MatrixIRBuilder utility (as suggested in [4]
>
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>),
> which can be used by frontends to create various matrix operations. For
> point wise operations, they will lower to vector instructions directly.
> Potential Future ExtensionsDuring the previous discussions, a few
> extensions were discussed, but we would like exclude them from the initial
> implementation. Those are
>
>    - N-Dimension
>    Given that the dimensions are represented as arguments to intrinsics,
>    we can go from supporting 2 dimensions to N dimensions by using variadic
>    arguments for shape information and handle it accordingly while
lowering.
>    - Padding
>    For small matrixes, it might be desirable to automatically add
>    additional padding to columns/rows, e.g. add 1 element padding to each
>    column in a 3 x 3 matrix, to allow for using vector instructions
operating
>    on power-of-2 number of elements or satisfy an alignment requirement by
a
>    target. This allows for additional optimizations, but is not required
for
>    lowering the intrinsics. We also haven't seen this being an issue so
far.
>    We should be able to iterate on that, once it becomes an issue. (Earlier
>    discussion [5]
>   
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html> )
>
> Support in ClangWe are also planning to expose the matrix support in
> Clang via a matrix type on the C/C++ level (similar to the
> ext_vector_type attribute) together with a set of matrix builtins.
> Similar to the LLVM part, we would like to propose a pragmatic solution for
> common matrix operations, while being flexible enough to allow iterative
> improvements to accommodate additional requirements. The main motivation
> for the matrix support on the Clang side is to give users a way to
>
>    - Guarantee generating high-quality code for matrix operations and
>    trees of matrix operations. For isolated operations, we can guarantee
>    vector code generation suitable for the target. For trees of operations,
>    the proposed value type helps with eliminating temporary loads &
stores.
>    - Make use of specialized matrix ISA extensions, like the new matrix
>    instructions in ARM v8.6 or various proprietary matrix accelerators, in
>    their C/C++ code.
>    - Move optimisations from matrix wrapper libraries into the compiler.
>    We use it internally to simplify an Eigen-style matrix library, by
relying
>    on LLVM for generating tiled & fused loops for matrix operations.
>
>
> We propose adding a new matrix value type, that can be declared via a
> `matrix_type` attribute. Alternatively we could also generalise the
> existing ext_vector_type attribute ([6]
>
<https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>),
> if that is preferred. Initially, the matrix type supports 2 dimensions
> (rows & columns). For example, a 4x4 float matrix would be declared as
> `typedef float m4x4_t __attribute__((matrix_type(4, 4)));`.
>
Hi Florian,

You can find Clang's policy for accepting language extensions described at
http://clang.llvm.org/get_involved.html

This extension clearly satisfies point 2, and I'm happy to trust that we
can iterate to satisfying points 6 and 7. I would like to hear your
thoughts on the other criteria.

Initially we want to focus on 2D matrixes without padding in
column-major> layout as a concrete use case. But our proposed type can be extended
> naturally to
>
>    - Support N (known constant) dimensions by turning matrix_type
>    attribute into a variadic attribute.
>    - Support column/row-wise padding, by adding a column_padding clause
>    to the attribute.
>    Dealing with the padding could be exclusively handled on the frontend
>    side, by emitting additional shufflevector instructions to extract the
>    data. If there is a desire to exploit the padding more on the LLVM side,
we
>    can add a set of intrinsics for that.
>    - Support row & column major layouts, by adding a layout clause to
the
>    attribute.
>    Again, this naively could be handled while lowering to LLVM IR in
>    Clang using shufflevector to produce flattened vectors with the required
>    layout. For better optimisations, the LLVM intrinsics relying on
>    shape/layout information can be extended to take the layout as
additional
>    argument. Through propagating the layout information similar to the
>    dimensions, we should be able to optimise the points where we need to
>    transform the layout of the underlying matrixes.
>
> In all cases, we require known constants as dimensions and we do not plan
> to support dynamic dimensions for now.
>
> In addition to the type, we propose adding a set of overloaded builtins to
> perform operations on matrix values. To start with, we would like to add
> the following builtins: __builtin_matrix_add, __builtin_matrix_subtract,
> __builtin_matrix_multiply, __builtin_matrix_scalar_multiply,
> __builtin_matrix_transpose, __builtin_matrix_insert,
> __builtin_matrix_extract, __builtin_matrix_column_load,
> __builtin_matrix_column_store. Those builtins allowed us to cover a wide
> range of uses cases internally, but as mentioned on the LLVM side already,
> we are also evaluating adding additional operations, like clamping matrixes
> or getting the minimum/maximum of a matrix. We should be able to iterate on
> the set of operations, as required by actual users.
>
> In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see
> LLVM part), that can lower the various builtins to LLVM IR (in terms of
> matrix builtins or vector operations, depending on the builtin).
>
>
>
> We currently have an internal implementation based on Adam’s original
> proposal ('first class matrix type') and are using that for a large
> internal project in production. We also prototyped the approach outlined in
> this proposal ('flattened matrixes') to make sure we can produce
equivalent
> code (and thus performance) with both approaches on a large codebase.
>
> We think predication is out of scope for the initial proposal and the
> right forum to discuss predication support in general are the related
> discussions on the list, the LLVM-VP proposal, [7]
> <https://reviews.llvm.org/D57504> and the round table at the Dev
Meeting.
>
> We think our current proposal addresses the concerns raised previously,
> especially the concerns around the high cost of adding a new IR type,
> adding too many new intrinsics and generalising the approach to N
> dimensions. Unless there are any additional major concerns, we should be
> able to share patches for review soon.
>
> Cheers,
> Florian
>
>
> [1] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html
> [2] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html
> [3] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html
> [4] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
> [5] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html
> [6]
>
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
> [7] https://reviews.llvm.org/D57504
> [8] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191119/ef188824/attachment.html>

Florian Hahn via llvm-dev

2019-Nov-22 14:18 UTC

head link

[llvm-dev] [cfe-dev] RFC: Matrix math support

> On Nov 19, 2019, at 21:24, Richard Smith <richard at metafoo.co.uk>
wrote:
> 
> Support in Clang
> 
> We are also planning to expose the matrix support in Clang via a matrix
type on the C/C++ level (similar to the ext_vector_type attribute) together with
a set of matrix builtins. Similar to the LLVM part, we would like to propose a
pragmatic solution for common matrix operations, while being flexible enough to
allow iterative improvements to accommodate additional requirements. The main
motivation for the matrix support on the Clang side is to give users a way to
> Guarantee generating high-quality code for matrix operations and trees of
matrix operations. For isolated operations, we can guarantee vector code
generation suitable for the target. For trees of operations, the proposed value
type helps with eliminating temporary loads & stores.
> Make use of specialized matrix ISA extensions, like the new matrix
instructions in ARM v8.6 or various proprietary matrix accelerators, in their
C/C++ code.
> Move optimisations from matrix wrapper libraries into the compiler. We use
it internally to simplify an Eigen-style matrix library, by relying on LLVM for
generating tiled & fused loops for matrix operations.
> 
> We propose adding a new matrix value type, that can be declared via a
`matrix_type` attribute. Alternatively we could also generalise the existing
ext_vector_type attribute ([6]
<https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>),
if that is preferred. Initially, the matrix type supports 2 dimensions (rows
& columns). For example, a 4x4 float matrix would be declared as `typedef
float m4x4_t __attribute__((matrix_type(4, 4)));`.
> 
> Hi Florian,
> 
> You can find Clang's policy for accepting language extensions described
at http://clang.llvm.org/get_involved.html
<http://clang.llvm.org/get_involved.html>
> 
> This extension clearly satisfies point 2, and I'm happy to trust that
we can iterate to satisfying points 6 and 7. I would like to hear your thoughts
on the other criteria.

Hi Richard,

Thank you very much for taking a look! We are working on a clang-focused update
to the proposal to get a discussion rolling on the Clang integration. Thanks for
pointing me to the policy for language extensions, we will make sure to cover
the points mentioned there.

With Thanksgiving coming up in the US, it will probably take us until after the
holidays to share the update.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191122/2f3d7dae/attachment.html>

llvm dev - Oct 2019 - [cfe-dev] RFC: Matrix math support

[llvm-dev] RFC: Matrix math support

[llvm-dev] [cfe-dev] RFC: Matrix math support

[llvm-dev] [cfe-dev] RFC: Matrix math support

[llvm-dev] RFC: Matrix math support

[llvm-dev] RFC: Matrix math support

[llvm-dev] [cfe-dev] RFC: Matrix math support

[llvm-dev] [cfe-dev] RFC: Matrix math support