thr3ads.net - llvm dev - [llvm-dev] [RFC] Matrix support (take 2) [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Adam Nemet via llvm-dev

2018-Dec-05 18:41 UTC

[llvm-dev] [RFC] Matrix support (take 2)

Hi all,

After the previous RFC[1], there were multiple discussions on the ML and in
person at the DevMtg.  I will summarize the options discussed and propose a path
forward.

==========================Options
==========================
A. Extend VectorType to be multidimensional

B. Flatten matrices into the current VectorType.  Matrix shape and layout
information is passed to matrix intrinsics.  All matrix operations including
element-wise matrix operations are implemented via intrinsics

C. Same as B but padding is explicitly managed by shufflevector instructions,
element-wise operations are implemented via built-in operators (e.g. fadd)

==========================tl;dr
==========================
There was some support for option A to introduce first-class matrices (or
multidimensional vectors) but also many concerns.  I have sketched out many
examples in IR and flattening matrices to vectors does not seem to present any
clear show-stoppers.  Thus, I am scaling back the proposal to option B which is
a more incremental step.

Once option B is completed, we can reevaluate.  If there is evidence that
first-class type support is necessary we can change course (option A). 
Otherwise, we can use this stage as a performance baseline and start
incrementally replacing the intrinsics with a combination of shufflevectors and
built-in operators in order to experiment with option C.

Throughout this work, an important goal is to provide a matrix-aware IRBuilder
API.  E.g.:

  Value *CreateMatrixAdd(Value *Op0, Value Op1,
                         unsigned Rows, unsigned Cols,
                         MatrixLayout ML /* row/column-major, padding */);

This will allow for simpler front-ends and would allow us to swap out the
generated IR if the design needs to change.

==========================Details 
==========================
-------------------------
Introduction
-------------------------

Representing multi-dimensional vectors in the IR through types makes the IR more
expressive (option A).  Additionally, if we have a new type we have the freedom
to implicitly map it to a layout.  E.g. <3 x 3 x float> could imply
column-major order and one element of padding between each column.  When it’s
passed or returned from functions it should be passed in 3 vector registers.

This is a sample IR to add two 3x3 matrices followed by a matrix multiply with a
3x2:

  %a = load <3 x 3 x float>, <3 x 3 x float>* %A
  %b = load <3 x 3 x float>, <3 x 3 x float>* %B
  %c = load <3 x 2 x float>, <3 x 2 x float>* %C

  %add = fadd <3 x 3 x float> %a, %b
  %mul = call <3 x 2 x float> @llvm.matrix.multiply(<3 x 3 x float>
%add,
                                                    <3 x 2 x float> %c)
  store	<3 x 2 x float>	%mul, <3 x 2 x float>* %MUL

Note that the type always implies a layout here.  If we have multiple layouts
appear in the same module beyond the default specified by DataLayout, we would
have these represented in the type, e.g. <3 x 3 x float column-major
pad(1)> specifying column-major layout with a single element of padding after
each 3-element column vector.

Also note that we’re using built-in fadd operator for element-wise operation and
an intrinsic for non-element-wise operations like matrix multiply.

Instead of extending the type system, we can map the matrix instances onto
existing types.  The vector type is a natural fit as it can be considered a
SequenceType with Row*Column elements of the element type.  For operations, like
matrix multiply we just need to pass the shape information for the extra
dimension.

One question arises though, how padding should be handled.  For one, performing
operations like division with the padding can cause spurious faults.  But even
for non-trapping operations excluding padding should be an option.  For example
in the case of a <3 x 3 x double>, we may want to lower a single
row/column into the combination of a 128B vector register (2 elts) and a scalar
rather than two vectors.  This should be more beneficial for power.  Thus we
want to make padding explicit in the IR.

One option is to expose the shape to all operations including element-wise
operations.  This is option B.  With that, the above sequence looks like this:

  %a = load <12 x float>, <12 x float>* %A, align 16
  %b = load <12 x float>, <12 x float>* %B, align 16
  %c = load <8 x float>, <8 x float>* %C, align 16

  %add = call <12 x float> @llvm.matrix.fadd(<12 x float> %a, <12
x float> %b,
      	      	      	      	      	    ;    3 x 3  column-major:
                                             i32 3, i32 3,    i1 true)

  %mul = call <8 x float> @llvm.matrix.multiply(<12 x float> %add,
<8 x float> %c,
					       ;    3 x 3             3 x 2  column-major:
                                                i32 3, i32 3,     i32 3, i32 2, 
i1 true)
  store <8 x float> %mul, <8 x float>* %MUL, align 16
 
Each computation takes full shape information.  The matrix shape is described
with the row and columns dimensions and are passed to the intrinsic as constant
parameters.  We can also pass layout information like whether the matrices are
laid out in row-major or column-major order.  We’re using column major order in
the example and as such the 3 x 3 x float matrix is flattened into a 12 x float
vector with one element padding at the end of each column.

The amount of padding does not require any new parameters.  We can compute it
using the shape information and the size of the flattened matrix (e.g. %c which
is a <3 x 2 x float> also has one element of padding: Elts / Columns -
Rows = 8 / 2 - 3 = 1).

In order to expose the padding elements to element-wise operation (fadd), option
B maps those to intrinsics.  We can expose the padding bytes in other ways such
that we can still use the built-in element-wise operators.  One way would be to
extend the vector types with specifying the padding, something like <12 x
float pad(3, 7, 11)> (John McCall’s idea) or removing the padding with
explicit shufflevectors (Chandler’s idea).  I explored the lattter under option
C.  With option C, the same sequence of operations look like this:

  %a.padded = load <12 x float>, <12 x float>* %A, align 16
  ; remove padding
  %a = shufflevector <12 x float> %a.padded, <12 x float> undef,
                     <9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32
6, i32 8, i32 9, i32 10>

  %b.padded = load <12 x float>, <12 x float>* %B, align 16
  %b = shufflevector <12 x float> %b.padded, <12 x float> undef,
                     <9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32
6, i32 8, i32 9, i32 10>

  %c.padded = load <8 x float>, <8 x float>* %C, align 16
  %c = shufflevector <8 x float> %c.padded, <8 x float> undef,
                     <6 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32
6>


  %add = fadd <9 x float> %a, %b
  %mul = call <6 x float> @llvm.matrix.multiply(<9 x float> %add,
<6 x float> %c,
                                               ;    3 x 3             3 x 2    
column-major:
                                                i32 3, i32 3,     i32 3, i32 2, 
i1 true)

  ; add back padding
  %mul.padded = shufflevector <6 x float> %mul, <6 x float> undef,
                              <8 x i32> <i32 0, i32 1, i32 2, i32
undef, i32 3, i32 4, i32 5, i32 undef>
  store	<8 x float> %mul.padded, <8 x float>* %MUL, align 16

The expectation is that these shufflevectors will be removed as we lower to
HW-sized vectors, e.g. using three 4-wide vector adds.

------------------------- 
Evaluation
------------------------- 

While we have a prototype for option A which seems to work fine, I wanted to
evaluate the other options and see how they measure up.  I chose these criteria:
Intrusiveness of the change
Matrix Operation Lowering and Fusion
Inlining to enable more fusion
Matrix HW, matrix library
Proper padding for small matrices
Pass and return small matrices in vector registers
------------------------- 
Intrusiveness of the change
------------------------- 

Admittedly, option A is the most intrusive.  All the places where VectorType is
handled would have to be audited and potentially extended to multiple
dimensions.  Operations like extract/insertelement would have to be modified to
take an index list.

On the CodeGen side, legalization support for VectorTypes would have to be
extended to multiple dimensions as well.

Option B and C are less intrusive.  They are also fairly similar.  One
difference is that Option B replicates some of the built-in operators in
intrinsics which may need dedicated support in some optimizations.

------------------------- 
Matrix Operation Lowering and Fusion
------------------------- 

Common to all of these options is that we are proposing a new IR pass that
pre-legalizes matrix operations by lowering them to operations that are natively
supported by the HW.  This means decomposing the operations into native SIMD
operations.

This pass will be used to also de-interleave a chain of matrix operations to
manage register pressure.  The idea is to analyze the chain and then work only
on subsections of the matrices at a time to avoid the need of keeping everything
in registers.

For simple lowering (not fusion), let’s consider a 3x2 addition followed by a
matrix-multiply with a 2x4 for option A:

  %a = load <3 x 2 x float>, <3 x 2 x float>* %A
  %b = load <3 x 2 x float>, <3 x 2 x float>* %B
  %c = load <2 x 4 x float>, <2 x 4 x float>* %C

  %add = fadd <3 x 2 x float> %a, %b
  %mul = call <3 x 3 x float> @llvm.matrix.multiply(<3 x 2 x float>
%add,
                                                    <2 x 4 x float> %c)

We can lower this by replacing the matrix instructions with a set of
instructions computing each HW-vector-sized chunk of each column (assuming
column-major order and padded columns to a 16B alignment).  We can do this in
reverse-post order starting with the loads:

  ; %a = load <3 x 2 x float>, <3 x 2 x float>* %A
  %a0 = load <4 x float>, <4 x float>* %A0, align 16
  %a1 = load <4 x float>, <4 x float>* %A1, align 16

  ;; ... similarly for b and c ...

  ; %add = fadd <3 x 2 x float> %a, %b
  %add0 = fadd <4 x float> %a0, %b0
  %add1 = fadd <4 x float> %a1, %b1

  ; %mul = call <3 x 4 x float> @llvm.matrix.multiply(<3 x 2 x
float> %add,
  ;                                                   <2 x 4 x float> %c)
  ...

We obviously want to arrive to the same lowering for the other options too.  The
original IR for option B contains flattened matrices and intrinsics for all
matrix operations:

  ;         3 x 2 pad(1)
  %a = load <8 x float>, <8 x float>* %A
  ;         3 x 2 pad(1)
  %b = load <8 x float>, <8 x float>* %B
  ;         2 x 4 pad(0)
  %c = load <8 x float>, <8 x float>* %C

  ;          3 x 2 pad(1)
  %add = call <8 x float> @llvm.matrix.fadd(<8 x float> %a, <8 x
float> %b,
                                            ;   3 x 2
                                            i32 3, i32 2)

  ;          3 x 4 pad(1)
  %mul = call <16 x float> @llvm.matrix.multiply(<8 x float> %add,
<8 x float> %c,
                                                 ;   3 x 2             2 x 4
                                                 i32 3, i32 2,     i32 2, i32 4)

Note that we only have shape and layout information on computations.  We don’t
have them on other instructions like: load, store, phi, select, bitcast, memcpy
intrinsic etc.  Since the shape and layout information is critical to avoid
unnecessary shuffles when working on rows/columns we need to recover this by
propagating this information to all matrix operations.

This is a forward, backward data-flow problem.  It is simplified because we only
have two states: “unspecified" or "specified shape and layout
information".  The IR is malformed if operands don’t have consistent shape
or layout (e.g. for addition they have to match; for matmul the inner dimensions
need to match).  Since we only revisit uses or defs as they transition from
unspecified to specified we can only visit each instruction twice at most.

Option C is also has flattened matrices but shape and layout information
(padding) is spread across computations and shufflevectors:

  ;               3 x 2 pad(1)
  %a.padded = load <8 x float>, <8 x float>* %A
  ;               3 x 2 pad(1)
  %b.padded = load <8 x float>, <8 x float>* %B
  ;        2 x 4 pad(0)
  %c = load <8 x float>, <8 x float>* %C

  %a = shufflevector <8 x float> %a.padded, <6 x i32> <i32 0, i32
1, i32 2, i32 4, i32 5, i32 6>
  %b = shufflevector <8 x float> %b.padded, <6 x i32> <i32 0, i32
1, i32 2, i32 4, i32 5, i32 6>
  ;          3 x 2
  %add = fadd <6 x float> %a, %b

  ;           3 x 4
  %mul = call <12 x float> @llvm.matrix.multiply(<6 x float> %add,
<8 x float> %c,
                                               ;    3 x 2             2 x 4
						i32 3, i32 2,     i32 2, i32 4)

Since the information is more spread around, the data-flow analysis becomes more
complicated as we need to track shape and layout separately.  Additionally,
since optimizations can transform the shufflevectors, we may require more
granular tracking of the layout which would make the analysis even more complex.

To summarize we have data flow problems to solve for option B and C in order to
recover information that is directly represented in the IR for option A. 
Solving the data flow problem is also expected to be harder for option C than B.

Fusion opportunistically lowers larger subgraphs rather than individual
operations .  It needs the same shape information.  If we fail to successfully
recover this information for the entire subgraph, fusion won’t take place.  It’s
hard to foresee the potential problems here but I expect that the complexity and
the failure rate of fusion would increase from option A to B and then to C.

------------------------- 
Inlining to enable more fusion
------------------------- 

The goal is to make the inliner aware of fusion opportunities in order to build
larger graphs.  Additionally for option C we will have to tweak the
profitability estimation logic to recognize and exclude the shufflevector
instructions.

------------------------- 
Matrix HW, matrix library
------------------------- 

This can potentially pose very diverse requirements but essentially instead of
decomposing matrix operations into efficient vector operations, the goal is to
decompose them into matrix operation of a certain natively-supported shape. As
outlined in the lowering section, since we can already establish the shape and
layout information at each matrix operations, we should be able to decompose
them into small matrix operations rather than going directly to vectors.  Thus,
this does not impose any new requirements, at least not in general.  I am sure
specific ports may have some interesting requirements in which case we may have
to adapt the design.

------------------------- 
Proper padding for small matrices
------------------------- 

We want the individual rows/columns (for row/column-major) to be naturally
aligned by no more than the max vector alignment.  E.g. for a 3 x 3 x float with
column major order, we should have a single element padding after each column
which is a total of 48 bytes.  For option A, since it’s a new type we can just
define this in the new ABI.

For option B and C, on the other hand, vector alignment and padding is already
mandated by the VectorType.  This is part of the ABI when people use vector_size
or ext_vector_type attributes on the clang side.

Alignment and allocation size (including padding) is the original size of vector
rounded up to the next power-of-2.  So for example a 3 x 3 x float pad(1) or 12
x float is rounded up to 64 bytes.  This is excessive.  I also need to support
unpadded matrices that are effectively ABI-compatible with arrays.

Front-ends could lower to arrays instead of vectors and then cast them to
vectors specifying the proper alignment.  This would complicate front-ends and
the IR.  A more reasonable solution would be allow reducing the alignment and in
turn the padding of the vector type itself.  We could have an align attribute on
the type itself:

For example <12 x float align(16)> for naturally aligned columns or <9
x float align(1)> for the unpadded case.

In summary, option B and C can be made to work with some IR extensions.

------------------------- 
Pass and return small matrices in vector registers
------------------------- 

Typically (e.g. AArch64, x86_64) non-power-of-2-sized vectors are passed and
returned in memory.  We want small matrices like a 3 x 3 x float to be passed
and returned in three vector registers assuming a vector width of 128 bits.

For option A, we can directly specify this in the new ABI.  For option B and C
however we need a new parameter attribute to enforce this.  For example,
‘split_vec(N)’ below would specify that the vector argument is passed split up
in chunks of N x <elt_type> in argument register where <elt_type> is
the same as the element type of the vector argument.

For example, this is a function taking two and returning a 3x3 matrix (with
option B syntax):

define <12 x float> split_vec(4) @f(<12 x float> split_vec(4) %a,
<12 x float> split_vec(4) %b) {
  %c = call <12 x float> @llvm.matrix.fadd(<12 x float> %a, <12 x
float> %b, i32 3, i32 3)
  ret <12 x float> %c
}

Let’s see how this is all lowered.  First, the IR lowering pass splits up the
flattened matrices in the body, arguments remain un-lowered:

define <12 x float> split_vec(4) @f(<12 x float> split_vec(4) %a,
<12 x float> split_vec(4) %b) {
  ; Extract columns
  %a0 = shufflevector <12 x float> %a, <12 x float> undef,
       <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
  %a1 = shufflevector <12 x float> %a, <12 x float> undef,
       <4 x i32> <i32 4, i32 5, i32 6, i32 undef>
  %a2 = shufflevector <12 x float> %a, <12 x float> undef,
       <4 x i32> <i32 8, i32 9, i32 10, i32 undef>

  ; same for b
  ; ...

  %c0 = fadd <4 x float> %a0, %b0
  %c1 = fadd <4 x float> %a1, %b1
  %c2 = fadd <4 x float> %a2, %b2

  ; Insert columns
  %c01 = shufflevector <4 x float> %c0,  <4 x float> %c1,
       <8 x i32> <i32 0, i32 1, i32 2, i32 undef,
                  i32 4, i32 5, i32 6, i32 undef >
  %c2.w = shufflevector <4 x float> %c2, <4 x float> undef,
       <8 x i32> <i32 0, i32 1, i32 2, i32 undef,
                  i32 undef, i32 undef, i32 undef, i32 undef >
  %c = shufflevector <8 x float> %c01,  <8 x float> %c2.w,
       <12 x i32> <i32 0, i32 1, i32 2, i32 undef,
                   i32 4, i32 5, i32 6, i32 undef,
                   i32 8, i32 9, i32 10, i32 undef >

  ret <12 x float> %c
}

Then during SDAG building, the arguments are split to map to ABI argument
registers and a CONCAT_VECTOR is used to recreate the original un-split value
(SDAG code in comments):


define <12 x float> split_vec(4) @f(<12 x float> split_vec(4) %a,
<12 x float> split_vec(4) %b) {
;                                      Register:v4f32 %0, %1, %2     
Register:v4f32 %3, %4, %5

;   t0: ch = EntryToken
;    t4:  v4f32,ch = CopyFromReg t0, Register:v4f32 %1
;    t6:  v4f32,ch = CopyFromReg t0, Register:v4f32 %2
;    t8:  v4f32,ch = CopyFromReg t0, Register:v4f32 %3
;    t10: v4f32,ch = CopyFromReg t0, Register:v4f32 %4
;    t12: v4f32,ch = CopyFromReg t0, Register:v4f32 %5
;    t14: v4f32,ch = CopyFromReg t0, Register:v4f32 %6

;    t15: v12f32 = CONCAT_VECTORS t4, t6, t8
;    t16: v12f32 = CONCAT_VECTORS t10, t12, t14
;------------

  %a0 = shufflevector <12 x float> %a, <12 x float> undef,
       <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
  %a1 = shufflevector <12 x float> %a, <12 x float> undef,
       <4 x i32> <i32 4, i32 5, i32 6, i32 undef>
  %a2 = shufflevector <12 x float> %a, <12 x float> undef,
       <4 x i32> <i32 8, i32 9, i32 10, i32 undef>

;    t17: v4f32 = EXTRACT_SUBVECTOR t15, 0
;    t18: v4f32 = EXTRACT_SUBVECTOR t15, 4
;    t19: v4f32 = EXTRACT_SUBVECTOR t15, 8

The CONCAT_VECTORs will be cancelled out by the EXTRACT_SUBVECTOR generated for
the first set of shufflevector.  So the argument registers will be forwarded
directly to the adds.

Similarly on the return path, the CONCAT_VECTOR generated for the second set of
shufflevectors will be cancelled out by the EXTRACT_SUBVECTOR that splits the
vector across the registers for the return value:

  %c01 = shufflevector <4 x float> %c0,  <4 x float> %c1,
       <8 x i32> <i32 0, i32 1, i32 2, i32 undef,
                  i32 4, i32 5, i32 6, i32 undef >
  %c2.w = shufflevector <4 x float> %c2, <4 x float> undef,
       <8 x i32> <i32 0, i32 1, i32 2, i32 undef,
                  i32 undef, i32 undef, i32 undef, i32 undef >
  %c = shufflevector <8 x float> %c01,  <8 x float> %c2.w,
       <12 x i32> <i32 0, i32 1, i32 2, i32 undef,
                   i32 4, i32 5, i32 6, i32 undef,
                   i32 8, i32 9, i32 10, i32 undef >

;   t26: v12f32 = CONCAT_VECTORS t23, t24, t25

  ret <12 x float> %c

;-----------
;   t27: v4f32 = EXTRACT_SUBVECTOR t26, 0
;   t28: v4f32 = EXTRACT_SUBVECTOR t26, 4
;   t29: v4f32 = EXTRACT_SUBVECTOR t26, 8

;   t42: ch,glue = CopyToReg t0,  Register:v4f32 $q0, t27
;   t44: ch,glue = CopyToReg t42, Register:v4f32 $q1, t28, t42:1
;   t46: ch,glue = CopyToReg t44, Register:v4f32 $q2, t29, t44:1

;   t49: ch = AArch64ISD::RET_FLAG t48, Register:v4f32 $q0, Register:v4f32 $q1,
Register:v4f32 $q2, Register:v4f32 $q3, t48:1
}
  
==========================Summary
==========================
Based on the feedback and the evaluation, option B seems like a natural first
step.  It needs some IR extensions (split_vec(N) and align(N) on the vector
type) but it’s still the fastest way to something working so that we can
experiment if in fact a full-fledge multi-dimensional vector support is
necessary in the IR.  The only downside it has that it can’t leverage the
existing optimizations for element-wise operations.

Option C could address that but may pose more challenges in recovering the
original matrix layout information.  The good news is that this can be explored
incrementally from a working option B.

Thanks,
Adam

[1]  http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html
<http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181205/0bbbcf52/attachment-0001.html>

John McCall via llvm-dev

2018-Dec-06 04:46 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

On 5 Dec 2018, at 13:41, Adam Nemet wrote:> -------------------------
> Proper padding for small matrices
> -------------------------
>
> We want the individual rows/columns (for row/column-major) to be 
> naturally aligned by no more than the max vector alignment.  E.g. for 
> a 3 x 3 x float with column major order, we should have a single 
> element padding after each column which is a total of 48 bytes.  For 
> option A, since it’s a new type we can just define this in the new 
> ABI.
>
> For option B and C, on the other hand, vector alignment and padding is 
> already mandated by the VectorType.  This is part of the ABI when 
> people use vector_size or ext_vector_type attributes on the clang 
> side.
>
> Alignment and allocation size (including padding) is the original size 
> of vector rounded up to the next power-of-2.  So for example a 3 x 3 x 
> float pad(1) or 12 x float is rounded up to 64 bytes.  This is 
> excessive.  I also need to support unpadded matrices that are 
> effectively ABI-compatible with arrays.
>
> Front-ends could lower to arrays instead of vectors and then cast them 
> to vectors specifying the proper alignment.  This would complicate 
> front-ends and the IR.  A more reasonable solution would be allow 
> reducing the alignment and in turn the padding of the vector type 
> itself.  We could have an align attribute on the type itself:
>
> For example <12 x float align(16)> for naturally aligned columns or
<9
> x float align(1)> for the unpadded case.
>
> In summary, option B and C can be made to work with some IR 
> extensions.
In general, IR type alignment is all but meaningless.  A well-behaved 
frontend should be setting an explicit alignment on every declaration, 
definition, allocation, and access it generates (if it's ABI, at least), 
and it must always lower aggregate types in a way that understands that 
IR type alignment can arbitrarily differ from the actual alignment of 
the frontend's original type.  So while I'm certainly supportive of 
allowing vector types to carry additional information besides just an 
element type and count, I'm a little skeptical specifically about the 
value of it carrying an alignment.

John.

Adam Nemet via llvm-dev

2018-Dec-06 06:24 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

> On Dec 5, 2018, at 8:46 PM, John McCall <jmccall at apple.com> wrote:
> 
> On 5 Dec 2018, at 13:41, Adam Nemet wrote:
>> -------------------------
>> Proper padding for small matrices
>> -------------------------
>> 
>> We want the individual rows/columns (for row/column-major) to be
naturally aligned by no more than the max vector alignment.  E.g. for a 3 x 3 x
float with column major order, we should have a single element padding after
each column which is a total of 48 bytes.  For option A, since it’s a new type
we can just define this in the new ABI.
>> 
>> For option B and C, on the other hand, vector alignment and padding is
already mandated by the VectorType.  This is part of the ABI when people use
vector_size or ext_vector_type attributes on the clang side.
>> 
>> Alignment and allocation size (including padding) is the original size
of vector rounded up to the next power-of-2.  So for example a 3 x 3 x float
pad(1) or 12 x float is rounded up to 64 bytes.  This is excessive.  I also need
to support unpadded matrices that are effectively ABI-compatible with arrays.
>> 
>> Front-ends could lower to arrays instead of vectors and then cast them
to vectors specifying the proper alignment.  This would complicate front-ends
and the IR.  A more reasonable solution would be allow reducing the alignment
and in turn the padding of the vector type itself.  We could have an align
attribute on the type itself:
>> 
>> For example <12 x float align(16)> for naturally aligned columns
or <9 x float align(1)> for the unpadded case.
>> 
>> In summary, option B and C can be made to work with some IR extensions.
> 
> In general, IR type alignment is all but meaningless.  
Sure, that’s clear.  I may not have been precise above that I actually tried to
lower the alignment in other ways but failed.
> A well-behaved frontend should be setting an explicit alignment on every
declaration, definition, allocation, and access it generates (if it's ABI,
at least), and it must always lower aggregate types in a way that understands
that IR type alignment can arbitrarily differ from the actual alignment of the
frontend's original type.
I wasn’t able to lower the alignment (and alloc size) for alloca through the
align attribute.  I assumed that it was meant to only allow increasing
alignment.

I am also unclear how lowering the alignment should work for a vector data
member of a structure type.  Would I have to make the structure packed and
manually fill in the padding?  Is that desirable every time a flattened matrix
is used inside a structure?

Thanks,
Adam
>  So while I'm certainly supportive of allowing vector types to carry
additional information besides just an element type and count, I'm a little
skeptical specifically about the value of it carrying an alignment.
> 
> John.

Chris Lattner via llvm-dev

2018-Dec-19 04:45 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

> On Dec 5, 2018, at 10:41 AM, Adam Nemet <anemet at apple.com> wrote:
> Hi all,
> 
> After the previous RFC[1], there were multiple discussions on the ML and in
person at the DevMtg.  I will summarize the options discussed and propose a path
forward.
> 
> ==========================> Options
> ==========================> 
> A. Extend VectorType to be multidimensional
> 
> B. Flatten matrices into the current VectorType.  Matrix shape and layout
information is passed to matrix intrinsics.  All matrix operations including
element-wise matrix operations are implemented via intrinsics
> 
> C. Same as B but padding is explicitly managed by shufflevector
instructions, element-wise operations are implemented via built-in operators
(e.g. fadd)
> 
> ==========================> tl;dr
> ==========================> 
> There was some support for option A to introduce first-class matrices (or
multidimensional vectors) but also many concerns.  I have sketched out many
examples in IR and flattening matrices to vectors does not seem to present any
clear show-stoppers.  Thus, I am scaling back the proposal to option B which is
a more incremental step.
Seems reasonable to start small, learn, then build out from there.
> Throughout this work, an important goal is to provide a matrix-aware
IRBuilder API.  E.g.:
> 
>   Value *CreateMatrixAdd(Value *Op0, Value Op1,
>                          unsigned Rows, unsigned Cols,
>                          MatrixLayout ML /* row/column-major, padding */);
> 
> This will allow for simpler front-ends and would allow us to swap out the
generated IR if the design needs to change.
Why does this have to be on IRBuilder?  If you were writing this in Swift, then
these would be an extension on IRBuilder that added some methods like this, but
could be kept out of core.

C++ isn’t as sophisticated, but you could still define these as a separate
header file with functions like:

	Value *BuilderCreateMatrixAdd(IRBuilder &B, … other args)

This would avoid adding a bunch of matrix specific stuff to the core IRBuilder
class and header.

> 
> ==========================> Details 
> ==========================> 
> -------------------------
> Introduction
> -------------------------
> 
> Representing multi-dimensional vectors in the IR through types makes the IR
more expressive (option A).  Additionally, if we have a new type we have the
freedom to implicitly map it to a layout.  E.g. <3 x 3 x float> could
imply column-major order and one element of padding between each column.  When
it’s passed or returned from functions it should be passed in 3 vector
registers.
> 
> This is a sample IR to add two 3x3 matrices followed by a matrix multiply
with a 3x2:
> 
>   %a = load <3 x 3 x float>, <3 x 3 x float>* %A
>   %b = load <3 x 3 x float>, <3 x 3 x float>* %B
>   %c = load <3 x 2 x float>, <3 x 2 x float>* %C
> 
>   %add = fadd <3 x 3 x float> %a, %b
>   %mul = call <3 x 2 x float> @llvm.matrix.multiply(<3 x 3 x
float> %add,
>                                                     <3 x 2 x float>
%c)
>   store	<3 x 2 x float>	%mul, <3 x 2 x float>* %MUL
LLVM requires name mangling the type into the intrinsic, but yeah.
> 
> Note that the type always implies a layout here.  If we have multiple
layouts appear in the same module beyond the default specified by DataLayout, we
would have these represented in the type, e.g. <3 x 3 x float column-major
pad(1)> specifying column-major layout with a single element of padding after
each 3-element column vector.
Right.
> Also note that we’re using built-in fadd operator for element-wise
operation and an intrinsic for non-element-wise operations like matrix multiply.
> 
> Instead of extending the type system, we can map the matrix instances onto
existing types.  The vector type is a natural fit as it can be considered a
SequenceType with Row*Column elements of the element type.  For operations, like
matrix multiply we just need to pass the shape information for the extra
dimension.
> 
> One question arises though, how padding should be handled.  For one,
performing operations like division with the padding can cause spurious faults. 
But even for non-trapping operations excluding padding should be an option.  For
example in the case of a <3 x 3 x double>, we may want to lower a single
row/column into the combination of a 128B vector register (2 elts) and a scalar
rather than two vectors.  This should be more beneficial for power.  Thus we
want to make padding explicit in the IR.
> 
> One option is to expose the shape to all operations including element-wise
operations.  This is option B.  With that, the above sequence looks like this:
> 
>   %a = load <12 x float>, <12 x float>* %A, align 16
>   %b = load <12 x float>, <12 x float>* %B, align 16
>   %c = load <8 x float>, <8 x float>* %C, align 16
> 
>   %add = call <12 x float> @llvm.matrix.fadd(<12 x float> %a,
<12 x float> %b,
>       	      	      	      	      	    ;    3 x 3  column-major:
>                                              i32 3, i32 3,    i1 true)
> 
>   %mul = call <8 x float> @llvm.matrix.multiply(<12 x float>
%add, <8 x float> %c,
> 					       ;    3 x 3             3 x 2  column-major:
>                                                 i32 3, i32 3,     i32 3,
i32 2,     i1 true)
>   store <8 x float> %mul, <8 x float>* %MUL, align 16
>  
> Each computation takes full shape information.  The matrix shape is
described with the row and columns dimensions and are passed to the intrinsic as
constant parameters.  We can also pass layout information like whether the
matrices are laid out in row-major or column-major order.  We’re using column
major order in the example and as such the 3 x 3 x float matrix is flattened
into a 12 x float vector with one element padding at the end of each column.
I don’t understand this.  What is the benefit of providing layout info to
element wise operations?  This defeats the goal of having simple lowering and
representation: you are encoding an ND vector form into the IR in a really ugly
way, and this will cause a proliferation of intrinsics that are redundant with
the core ops.

Also, are 2d matrices really as general as we want to go here?  Generally you go
from 1 to 2 to N, and it seems lik you are proposing going from 1 (scalar) to 2
(vectors) to 3 (2d arrays) without giving N.  If you want to provide layout info
for general ND arrays, a single bit is not going to be enough nor is your
row/col size representation.
> The amount of padding does not require any new parameters.  We can compute
it using the shape information and the size of the flattened matrix (e.g. %c
which is a <3 x 2 x float> also has one element of padding: Elts / Columns
- Rows = 8 / 2 - 3 = 1).
> 
> In order to expose the padding elements to element-wise operation (fadd),
option B maps those to intrinsics.  We can expose the padding bytes in other
ways such that we can still use the built-in element-wise operators.  One way
would be to extend the vector types with specifying the padding, something like
<12 x float pad(3, 7, 11)> (John McCall’s idea) or removing the padding
with explicit shufflevectors (Chandler’s idea).  I explored the lattter under
option C.  With option C, the same sequence of operations look like this:
> 
>   %a.padded = load <12 x float>, <12 x float>* %A, align 16
>   ; remove padding
>   %a = shufflevector <12 x float> %a.padded, <12 x float>
undef,
>                      <9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5,
i32 6, i32 8, i32 9, i32 10>
> 
>   %b.padded = load <12 x float>, <12 x float>* %B, align 16
>   %b = shufflevector <12 x float> %b.padded, <12 x float>
undef,
>                      <9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5,
i32 6, i32 8, i32 9, i32 10>
> 
>   %c.padded = load <8 x float>, <8 x float>* %C, align 16
>   %c = shufflevector <8 x float> %c.padded, <8 x float> undef,
>                      <6 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5,
i32 6>
> 
> 
>   %add = fadd <9 x float> %a, %b
I don’t understand why you’re trying to avoid adding padding.  If you are
worried about snans, then it seems that you could arrange for the producers of
padding to have some guaranteed properties instead of being undef.
> 
> ------------------------- 
> Matrix Operation Lowering and Fusion
> ------------------------- 
> 
> Common to all of these options is that we are proposing a new IR pass that
pre-legalizes matrix operations by lowering them to operations that are natively
supported by the HW.  This means decomposing the operations into native SIMD
operations.
> 
> This pass will be used to also de-interleave a chain of matrix operations
to manage register pressure.
Cool.  While I’m not really thrilled with yet another “codegenprepare” style
pass, I agree it is the most pragmatic given the lack of pervasive global isel
etc.

Just an observation: this pass can be dropped in today and would be useful for
large vectors, independent of your matrix work.

> 
> Note that we only have shape and layout information on computations.  We
don’t have them on other instructions like: load, store, phi, select, bitcast,
memcpy intrinsic etc.  Since the shape and layout information is critical to
avoid unnecessary shuffles when working on rows/columns we need to recover this
by propagating this information to all matrix operations.
My sense is that this info is important for your lowering, and your approach of
using dataflow analysis to recover this will fail in some cases.

Since layout and padding information is important, it seems most logical to put
this into the type.  Doing so would make it available in all these places.

That said, I still don’t really understand why you *need* it.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181218/8e76ff00/attachment-0001.html>

Adam Nemet via llvm-dev

2018-Dec-19 06:18 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

Hi Chris,
> On Dec 18, 2018, at 8:45 PM, Chris Lattner <clattner at nondot.org>
wrote:
> 
> 
>> On Dec 5, 2018, at 10:41 AM, Adam Nemet <anemet at apple.com>
wrote:
>> Hi all,
>> 
>> After the previous RFC[1], there were multiple discussions on the ML
and in person at the DevMtg.  I will summarize the options discussed and propose
a path forward.
>> 
>> ==========================>> Options
>> ==========================>> 
>> A. Extend VectorType to be multidimensional
>> 
>> B. Flatten matrices into the current VectorType.  Matrix shape and
layout information is passed to matrix intrinsics.  All matrix operations
including element-wise matrix operations are implemented via intrinsics
>> 
>> C. Same as B but padding is explicitly managed by shufflevector
instructions, element-wise operations are implemented via built-in operators
(e.g. fadd)
>> 
>> ==========================>> tl;dr
>> ==========================>> 
>> There was some support for option A to introduce first-class matrices
(or multidimensional vectors) but also many concerns.  I have sketched out many
examples in IR and flattening matrices to vectors does not seem to present any
clear show-stoppers.  Thus, I am scaling back the proposal to option B which is
a more incremental step.
> 
> Seems reasonable to start small, learn, then build out from there.
> 
>> Throughout this work, an important goal is to provide a matrix-aware
IRBuilder API.  E.g.:
>> 
>>   Value *CreateMatrixAdd(Value *Op0, Value Op1,
>>                          unsigned Rows, unsigned Cols,
>>                          MatrixLayout ML /* row/column-major, padding
*/);
>> 
>> This will allow for simpler front-ends and would allow us to swap out
the generated IR if the design needs to change.
> 
> Why does this have to be on IRBuilder?  If you were writing this in Swift,
then these would be an extension on IRBuilder that added some methods like this,
but could be kept out of core.
> 
> C++ isn’t as sophisticated, but you could still define these as a separate
header file with functions like:
> 
> 	Value *BuilderCreateMatrixAdd(IRBuilder &B, … other args)
> 
> This would avoid adding a bunch of matrix specific stuff to the core
IRBuilder class and header.
Sure, that works too.
> 
> 
>> 
>> ==========================>> Details 
>> ==========================>> 
>> -------------------------
>> Introduction
>> -------------------------
>> 
>> Representing multi-dimensional vectors in the IR through types makes
the IR more expressive (option A).  Additionally, if we have a new type we have
the freedom to implicitly map it to a layout.  E.g. <3 x 3 x float> could
imply column-major order and one element of padding between each column.  When
it’s passed or returned from functions it should be passed in 3 vector
registers.
>> 
>> This is a sample IR to add two 3x3 matrices followed by a matrix
multiply with a 3x2:
>> 
>>   %a = load <3 x 3 x float>, <3 x 3 x float>* %A
>>   %b = load <3 x 3 x float>, <3 x 3 x float>* %B
>>   %c = load <3 x 2 x float>, <3 x 2 x float>* %C
>> 
>>   %add = fadd <3 x 3 x float> %a, %b
>>   %mul = call <3 x 2 x float> @llvm.matrix.multiply(<3 x 3 x
float> %add,
>>                                                     <3 x 2 x
float> %c)
>>   store	<3 x 2 x float>	%mul, <3 x 2 x float>* %MUL
> 
> LLVM requires name mangling the type into the intrinsic, but yeah.
Yes, we have that in our prototype.  I just removed it from here for
readability.
> 
>> 
>> Note that the type always implies a layout here.  If we have multiple
layouts appear in the same module beyond the default specified by DataLayout, we
would have these represented in the type, e.g. <3 x 3 x float column-major
pad(1)> specifying column-major layout with a single element of padding after
each 3-element column vector.
> 
> Right.
> 
>> Also note that we’re using built-in fadd operator for element-wise
operation and an intrinsic for non-element-wise operations like matrix multiply.
>> 
>> Instead of extending the type system, we can map the matrix instances
onto existing types.  The vector type is a natural fit as it can be considered a
SequenceType with Row*Column elements of the element type.  For operations, like
matrix multiply we just need to pass the shape information for the extra
dimension.
>> 
>> One question arises though, how padding should be handled.  For one,
performing operations like division with the padding can cause spurious faults. 
But even for non-trapping operations excluding padding should be an option.  For
example in the case of a <3 x 3 x double>, we may want to lower a single
row/column into the combination of a 128B vector register (2 elts) and a scalar
rather than two vectors.  This should be more beneficial for power.  Thus we
want to make padding explicit in the IR.
>> 
>> One option is to expose the shape to all operations including
element-wise operations.  This is option B.  With that, the above sequence looks
like this:
>> 
>>   %a = load <12 x float>, <12 x float>* %A, align 16
>>   %b = load <12 x float>, <12 x float>* %B, align 16
>>   %c = load <8 x float>, <8 x float>* %C, align 16
>> 
>>   %add = call <12 x float> @llvm.matrix.fadd(<12 x float>
%a, <12 x float> %b,
>>       	      	      	      	      	    ;    3 x 3  column-major:
>>                                              i32 3, i32 3,    i1 true)
>> 
>>   %mul = call <8 x float> @llvm.matrix.multiply(<12 x
float> %add, <8 x float> %c,
>> 					       ;    3 x 3             3 x 2  column-major:
>>                                                 i32 3, i32 3,     i32
3, i32 2,     i1 true)
>>   store <8 x float> %mul, <8 x float>* %MUL, align 16
>>  
>> Each computation takes full shape information.  The matrix shape is
described with the row and columns dimensions and are passed to the intrinsic as
constant parameters.  We can also pass layout information like whether the
matrices are laid out in row-major or column-major order.  We’re using column
major order in the example and as such the 3 x 3 x float matrix is flattened
into a 12 x float vector with one element padding at the end of each column.
> 
> I don’t understand this.  What is the benefit of providing layout info to
element wise operations?  This defeats the goal of having simple lowering and
representation: you are encoding an ND vector form into the IR in a really ugly
way, and this will cause a proliferation of intrinsics that are redundant with
the core ops.
The reason we need that information so that for example we can lower an
operation on a 3-element column into a vector of 2 and a scalar op.  This should
be beneficial for power consumption since for example in the case of a 3x3 with
a single element padding rather than operating on 12 elements you’d operate only
on 9 (vector ops consume more power than their scalar counterparts).

That said we should be able to remove these intrinsics in the long term.  Once
we have masking on the core ops in the IR, we should be able to express the same
semantics without dedicated intrinsics.
> 
> Also, are 2d matrices really as general as we want to go here?  Generally
you go from 1 to 2 to N, and it seems lik you are proposing going from 1
(scalar) to 2 (vectors) to 3 (2d arrays) without giving N.  If you want to
provide layout info for general ND arrays, a single bit is not going to be
enough nor is your row/col size representation.
Yes, we could start with an ND-ready interface too.  I was going to start with
just matrix because that is my immediate need and then generalize from there but
I guess we can start with the more generalized intrinsic even if we first only
focus on the 2D implementation?
> 
>> The amount of padding does not require any new parameters.  We can
compute it using the shape information and the size of the flattened matrix
(e.g. %c which is a <3 x 2 x float> also has one element of padding: Elts
/ Columns - Rows = 8 / 2 - 3 = 1).
>> 
>> In order to expose the padding elements to element-wise operation
(fadd), option B maps those to intrinsics.  We can expose the padding bytes in
other ways such that we can still use the built-in element-wise operators.  One
way would be to extend the vector types with specifying the padding, something
like <12 x float pad(3, 7, 11)> (John McCall’s idea) or removing the
padding with explicit shufflevectors (Chandler’s idea).  I explored the lattter
under option C.  With option C, the same sequence of operations look like this:
>> 
>>   %a.padded = load <12 x float>, <12 x float>* %A, align 16
>>   ; remove padding
>>   %a = shufflevector <12 x float> %a.padded, <12 x float>
undef,
>>                      <9 x i32> <i32 0, i32 1, i32 2, i32 4,
i32 5, i32 6, i32 8, i32 9, i32 10>
>> 
>>   %b.padded = load <12 x float>, <12 x float>* %B, align 16
>>   %b = shufflevector <12 x float> %b.padded, <12 x float>
undef,
>>                      <9 x i32> <i32 0, i32 1, i32 2, i32 4,
i32 5, i32 6, i32 8, i32 9, i32 10>
>> 
>>   %c.padded = load <8 x float>, <8 x float>* %C, align 16
>>   %c = shufflevector <8 x float> %c.padded, <8 x float>
undef,
>>                      <6 x i32> <i32 0, i32 1, i32 2, i32 4,
i32 5, i32 6>
>> 
>> 
>>   %add = fadd <9 x float> %a, %b
> 
> I don’t understand why you’re trying to avoid adding padding.  If you are
worried about snans, then it seems that you could arrange for the producers of
padding to have some guaranteed properties instead of being undef.
It’s more the extra work/power consumed by the padding elements that concerns
me.
> 
>> 
>> ------------------------- 
>> Matrix Operation Lowering and Fusion
>> ------------------------- 
>> 
>> Common to all of these options is that we are proposing a new IR pass
that pre-legalizes matrix operations by lowering them to operations that are
natively supported by the HW.  This means decomposing the operations into native
SIMD operations.
>> 
>> This pass will be used to also de-interleave a chain of matrix
operations to manage register pressure.
> 
> Cool.  While I’m not really thrilled with yet another “codegenprepare”
style pass, I agree it is the most pragmatic given the lack of pervasive global
isel etc.
> 
> Just an observation: this pass can be dropped in today and would be useful
for large vectors, independent of your matrix work.
> 
> 
>> 
>> Note that we only have shape and layout information on computations. 
We don’t have them on other instructions like: load, store, phi, select,
bitcast, memcpy intrinsic etc.  Since the shape and layout information is
critical to avoid unnecessary shuffles when working on rows/columns we need to
recover this by propagating this information to all matrix operations.
> 
> My sense is that this info is important for your lowering, and your
approach of using dataflow analysis to recover this will fail in some cases.
> 
> Since layout and padding information is important, it seems most logical to
put this into the type.  Doing so would make it available in all these places.
> 
> That said, I still don’t really understand why you *need* it.
This seems like the main sticking point so let’s close on this first and see if
my answers above are satisfying.

Thanks for taking a look!

Adam
> 
> -Chris
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181218/316c3ecf/attachment.html>

lkcl via llvm-dev

2019-Feb-08 03:57 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

On Tue Dec 18 20:45:12 PST 2018, Chris wrote:
> Since layout and padding information is important, it seems most
> logical to put this into the type.  Doing so would make it available
> in all these places.
> That said, I still don’t really understand why you *need* it.
for large vectors and matrices that simply will not fit into the register
file, LD/ST and MV etc. in the form of gather/scatter or vectorised MVX [1]
is the clear and obvious requirement.

however the penalty for use of LD/ST is the power consumption hit of
going through the L1/L2 cache barrier.

for a low-power cost-competitive 3D GPU, for example, a 100% increase in
power consumption due to the penalty of being forced to move data back
and forth multiple times through the L1/L2 cache would be completely
unacceptable.

hence the natural solution, for small vectors and matrices, to be able
to process them *in-place*.

that in turn means having, at the *architectural* level, a way to re-order
the sequence of an otherwise straight linear 1D array of elements.  with
the right re-ordering capability, it even becomes possible to do arbitrary
in-place transposition of the order of elements, such that matrix multiply
may be done *in-place*, without MV operations.

this practice is extremely common in 3D GPUs, as there tend to be a lot
of 3x4 matrices.  ARM MALI actually added a special hard-coded set of
operations just to deal with 3x4 matrix data.

l.

[1] regfile[regfile[rs]] = regfile[rd]

llvm dev - Dec 2018 - [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)