thr3ads.net - llvm dev - [llvm-dev] [RFC] Matrix support (take 2) [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Adam Nemet via llvm-dev

2018-Dec-19 06:18 UTC

[llvm-dev] [RFC] Matrix support (take 2)

Hi Chris,
> On Dec 18, 2018, at 8:45 PM, Chris Lattner <clattner at nondot.org>
wrote:
> 
> 
>> On Dec 5, 2018, at 10:41 AM, Adam Nemet <anemet at apple.com>
wrote:
>> Hi all,
>> 
>> After the previous RFC[1], there were multiple discussions on the ML
and in person at the DevMtg.  I will summarize the options discussed and propose
a path forward.
>> 
>> ==========================>> Options
>> ==========================>> 
>> A. Extend VectorType to be multidimensional
>> 
>> B. Flatten matrices into the current VectorType.  Matrix shape and
layout information is passed to matrix intrinsics.  All matrix operations
including element-wise matrix operations are implemented via intrinsics
>> 
>> C. Same as B but padding is explicitly managed by shufflevector
instructions, element-wise operations are implemented via built-in operators
(e.g. fadd)
>> 
>> ==========================>> tl;dr
>> ==========================>> 
>> There was some support for option A to introduce first-class matrices
(or multidimensional vectors) but also many concerns.  I have sketched out many
examples in IR and flattening matrices to vectors does not seem to present any
clear show-stoppers.  Thus, I am scaling back the proposal to option B which is
a more incremental step.
> 
> Seems reasonable to start small, learn, then build out from there.
> 
>> Throughout this work, an important goal is to provide a matrix-aware
IRBuilder API.  E.g.:
>> 
>>   Value *CreateMatrixAdd(Value *Op0, Value Op1,
>>                          unsigned Rows, unsigned Cols,
>>                          MatrixLayout ML /* row/column-major, padding
*/);
>> 
>> This will allow for simpler front-ends and would allow us to swap out
the generated IR if the design needs to change.
> 
> Why does this have to be on IRBuilder?  If you were writing this in Swift,
then these would be an extension on IRBuilder that added some methods like this,
but could be kept out of core.
> 
> C++ isn’t as sophisticated, but you could still define these as a separate
header file with functions like:
> 
> 	Value *BuilderCreateMatrixAdd(IRBuilder &B, … other args)
> 
> This would avoid adding a bunch of matrix specific stuff to the core
IRBuilder class and header.
Sure, that works too.
> 
> 
>> 
>> ==========================>> Details 
>> ==========================>> 
>> -------------------------
>> Introduction
>> -------------------------
>> 
>> Representing multi-dimensional vectors in the IR through types makes
the IR more expressive (option A).  Additionally, if we have a new type we have
the freedom to implicitly map it to a layout.  E.g. <3 x 3 x float> could
imply column-major order and one element of padding between each column.  When
it’s passed or returned from functions it should be passed in 3 vector
registers.
>> 
>> This is a sample IR to add two 3x3 matrices followed by a matrix
multiply with a 3x2:
>> 
>>   %a = load <3 x 3 x float>, <3 x 3 x float>* %A
>>   %b = load <3 x 3 x float>, <3 x 3 x float>* %B
>>   %c = load <3 x 2 x float>, <3 x 2 x float>* %C
>> 
>>   %add = fadd <3 x 3 x float> %a, %b
>>   %mul = call <3 x 2 x float> @llvm.matrix.multiply(<3 x 3 x
float> %add,
>>                                                     <3 x 2 x
float> %c)
>>   store	<3 x 2 x float>	%mul, <3 x 2 x float>* %MUL
> 
> LLVM requires name mangling the type into the intrinsic, but yeah.
Yes, we have that in our prototype.  I just removed it from here for
readability.
> 
>> 
>> Note that the type always implies a layout here.  If we have multiple
layouts appear in the same module beyond the default specified by DataLayout, we
would have these represented in the type, e.g. <3 x 3 x float column-major
pad(1)> specifying column-major layout with a single element of padding after
each 3-element column vector.
> 
> Right.
> 
>> Also note that we’re using built-in fadd operator for element-wise
operation and an intrinsic for non-element-wise operations like matrix multiply.
>> 
>> Instead of extending the type system, we can map the matrix instances
onto existing types.  The vector type is a natural fit as it can be considered a
SequenceType with Row*Column elements of the element type.  For operations, like
matrix multiply we just need to pass the shape information for the extra
dimension.
>> 
>> One question arises though, how padding should be handled.  For one,
performing operations like division with the padding can cause spurious faults. 
But even for non-trapping operations excluding padding should be an option.  For
example in the case of a <3 x 3 x double>, we may want to lower a single
row/column into the combination of a 128B vector register (2 elts) and a scalar
rather than two vectors.  This should be more beneficial for power.  Thus we
want to make padding explicit in the IR.
>> 
>> One option is to expose the shape to all operations including
element-wise operations.  This is option B.  With that, the above sequence looks
like this:
>> 
>>   %a = load <12 x float>, <12 x float>* %A, align 16
>>   %b = load <12 x float>, <12 x float>* %B, align 16
>>   %c = load <8 x float>, <8 x float>* %C, align 16
>> 
>>   %add = call <12 x float> @llvm.matrix.fadd(<12 x float>
%a, <12 x float> %b,
>>       	      	      	      	      	    ;    3 x 3  column-major:
>>                                              i32 3, i32 3,    i1 true)
>> 
>>   %mul = call <8 x float> @llvm.matrix.multiply(<12 x
float> %add, <8 x float> %c,
>> 					       ;    3 x 3             3 x 2  column-major:
>>                                                 i32 3, i32 3,     i32
3, i32 2,     i1 true)
>>   store <8 x float> %mul, <8 x float>* %MUL, align 16
>>  
>> Each computation takes full shape information.  The matrix shape is
described with the row and columns dimensions and are passed to the intrinsic as
constant parameters.  We can also pass layout information like whether the
matrices are laid out in row-major or column-major order.  We’re using column
major order in the example and as such the 3 x 3 x float matrix is flattened
into a 12 x float vector with one element padding at the end of each column.
> 
> I don’t understand this.  What is the benefit of providing layout info to
element wise operations?  This defeats the goal of having simple lowering and
representation: you are encoding an ND vector form into the IR in a really ugly
way, and this will cause a proliferation of intrinsics that are redundant with
the core ops.
The reason we need that information so that for example we can lower an
operation on a 3-element column into a vector of 2 and a scalar op.  This should
be beneficial for power consumption since for example in the case of a 3x3 with
a single element padding rather than operating on 12 elements you’d operate only
on 9 (vector ops consume more power than their scalar counterparts).

That said we should be able to remove these intrinsics in the long term.  Once
we have masking on the core ops in the IR, we should be able to express the same
semantics without dedicated intrinsics.
> 
> Also, are 2d matrices really as general as we want to go here?  Generally
you go from 1 to 2 to N, and it seems lik you are proposing going from 1
(scalar) to 2 (vectors) to 3 (2d arrays) without giving N.  If you want to
provide layout info for general ND arrays, a single bit is not going to be
enough nor is your row/col size representation.
Yes, we could start with an ND-ready interface too.  I was going to start with
just matrix because that is my immediate need and then generalize from there but
I guess we can start with the more generalized intrinsic even if we first only
focus on the 2D implementation?
> 
>> The amount of padding does not require any new parameters.  We can
compute it using the shape information and the size of the flattened matrix
(e.g. %c which is a <3 x 2 x float> also has one element of padding: Elts
/ Columns - Rows = 8 / 2 - 3 = 1).
>> 
>> In order to expose the padding elements to element-wise operation
(fadd), option B maps those to intrinsics.  We can expose the padding bytes in
other ways such that we can still use the built-in element-wise operators.  One
way would be to extend the vector types with specifying the padding, something
like <12 x float pad(3, 7, 11)> (John McCall’s idea) or removing the
padding with explicit shufflevectors (Chandler’s idea).  I explored the lattter
under option C.  With option C, the same sequence of operations look like this:
>> 
>>   %a.padded = load <12 x float>, <12 x float>* %A, align 16
>>   ; remove padding
>>   %a = shufflevector <12 x float> %a.padded, <12 x float>
undef,
>>                      <9 x i32> <i32 0, i32 1, i32 2, i32 4,
i32 5, i32 6, i32 8, i32 9, i32 10>
>> 
>>   %b.padded = load <12 x float>, <12 x float>* %B, align 16
>>   %b = shufflevector <12 x float> %b.padded, <12 x float>
undef,
>>                      <9 x i32> <i32 0, i32 1, i32 2, i32 4,
i32 5, i32 6, i32 8, i32 9, i32 10>
>> 
>>   %c.padded = load <8 x float>, <8 x float>* %C, align 16
>>   %c = shufflevector <8 x float> %c.padded, <8 x float>
undef,
>>                      <6 x i32> <i32 0, i32 1, i32 2, i32 4,
i32 5, i32 6>
>> 
>> 
>>   %add = fadd <9 x float> %a, %b
> 
> I don’t understand why you’re trying to avoid adding padding.  If you are
worried about snans, then it seems that you could arrange for the producers of
padding to have some guaranteed properties instead of being undef.
It’s more the extra work/power consumed by the padding elements that concerns
me.
> 
>> 
>> ------------------------- 
>> Matrix Operation Lowering and Fusion
>> ------------------------- 
>> 
>> Common to all of these options is that we are proposing a new IR pass
that pre-legalizes matrix operations by lowering them to operations that are
natively supported by the HW.  This means decomposing the operations into native
SIMD operations.
>> 
>> This pass will be used to also de-interleave a chain of matrix
operations to manage register pressure.
> 
> Cool.  While I’m not really thrilled with yet another “codegenprepare”
style pass, I agree it is the most pragmatic given the lack of pervasive global
isel etc.
> 
> Just an observation: this pass can be dropped in today and would be useful
for large vectors, independent of your matrix work.
> 
> 
>> 
>> Note that we only have shape and layout information on computations. 
We don’t have them on other instructions like: load, store, phi, select,
bitcast, memcpy intrinsic etc.  Since the shape and layout information is
critical to avoid unnecessary shuffles when working on rows/columns we need to
recover this by propagating this information to all matrix operations.
> 
> My sense is that this info is important for your lowering, and your
approach of using dataflow analysis to recover this will fail in some cases.
> 
> Since layout and padding information is important, it seems most logical to
put this into the type.  Doing so would make it available in all these places.
> 
> That said, I still don’t really understand why you *need* it.
This seems like the main sticking point so let’s close on this first and see if
my answers above are satisfying.

Thanks for taking a look!

Adam
> 
> -Chris
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181218/316c3ecf/attachment.html>

David Greene via llvm-dev

2018-Dec-19 18:28 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

Adam Nemet via llvm-dev <llvm-dev at lists.llvm.org> writes:
> That said we should be able to remove these intrinsics in the long
> term. Once we have masking on the core ops in the IR, we should be
> able to express the same semantics without dedicated intrinsics.
Is there actually a proposal for adding core masking support?  It's been
brought up several times in the past but always ends up in the "too hard
to do" bin.

                          -David

Stephen Canon via llvm-dev

2018-Dec-19 19:09 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

> On Dec 18, 2018, at 10:18 PM, Adam Nemet <anemet at apple.com> wrote:
> 
>> I don’t understand this.  What is the benefit of providing layout info
to element wise operations?  This defeats the goal of having simple lowering and
representation: you are encoding an ND vector form into the IR in a really ugly
way, and this will cause a proliferation of intrinsics that are redundant with
the core ops.
> 
> The reason we need that information so that for example we can lower an
operation on a 3-element column into a vector of 2 and a scalar op.  This should
be beneficial for power consumption since for example in the case of a 3x3 with
a single element padding rather than operating on 12 elements you’d operate only
on 9 (vector ops consume more power than their scalar counterparts).
> 
> That said we should be able to remove these intrinsics in the long term. 
Once we have masking on the core ops in the IR, we should be able to express the
same semantics without dedicated intrinsics.
There may be some cases where this holds (maybe with 5x5 or something), but most
of the time I would expect to get better power from doing a four-element vector
op with one wasted lane than doing two arithmetic ops (plus possibly extracts
and inserts, depending on physical layout details).

Explicit masking or arranging for zero in padding lanes seems like a better way
forward to me.
– Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/6152936f/attachment.html>

Stephen Canon via llvm-dev

2018-Dec-19 21:31 UTC

head link

[llvm-dev] [RFC] Matrix support (take 2)

> On Dec 19, 2018, at 11:09 AM, Stephen Canon via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
>> On Dec 18, 2018, at 10:18 PM, Adam Nemet <anemet at apple.com
<mailto:anemet at apple.com>> wrote:
>> 
>>> I don’t understand this.  What is the benefit of providing layout
info to element wise operations?  This defeats the goal of having simple
lowering and representation: you are encoding an ND vector form into the IR in a
really ugly way, and this will cause a proliferation of intrinsics that are
redundant with the core ops.
>> 
>> The reason we need that information so that for example we can lower an
operation on a 3-element column into a vector of 2 and a scalar op.  This should
be beneficial for power consumption since for example in the case of a 3x3 with
a single element padding rather than operating on 12 elements you’d operate only
on 9 (vector ops consume more power than their scalar counterparts).
>> 
>> That said we should be able to remove these intrinsics in the long
term.  Once we have masking on the core ops in the IR, we should be able to
express the same semantics without dedicated intrinsics.
> 
> There may be some cases where this holds (maybe with 5x5 or something), but
most of the time I would expect to get better power from doing a four-element
vector op with one wasted lane than doing two arithmetic ops (plus possibly
extracts and inserts, depending on physical layout details).
> 
> Explicit masking or arranging for zero in padding lanes seems like a better
way forward to me.
> – Steve
I spent some time chatting with Adam about this and have a better understanding
of his concerns here. It seems to me that if having masking intrinsics is the
long-term solution we want, we should do that now (for add and sub) rather than
building arbitrary matrix layout info into intrinsics, since a mask has all the
information that we actually need.

– Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/3715fd08/attachment.html>

llvm dev - Dec 2018 - [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)

[llvm-dev] [RFC] Matrix support (take 2)