thr3ads.net - llvm dev - [llvm-dev] Intel AMX programming model discussion. [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Luo, Yuanke via llvm-dev

2020-Aug-21 06:12 UTC

[llvm-dev] Intel AMX programming model discussion.

Hi Hal,
The proposal is attractive to me, but there is something I still can't
figure out. Let's take below MIR as an example. We assume we have 256
register classes (vtile1x1, vtile1x2, ..., tile16x16).

1.       After instruction selection, the pseudo AMX instruction is generated.
The name of pseudo instructions have 'P' prefix. Now all the AMX pseudo
instruction take vtile as register class. Let's assume %13 is constant 3,
%10 is constant 4 and %14 is variable.
  %1:vtile = PTILELOADDV %13:gr16, %10:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %2:vtile = PTILELOADDV %10:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %3:vtile = PTILELOADDV %13:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
%21:vtile = PTDPBSSDV %13:gr16, %10:gr16, %14:gr16, %3:vtile(tied-def 0),
%1:vtile, %2:vtile

2.       The configuration-placement pass looks at all of the AMX
pseudo-instructions and identifies regions in which the pseudo-instructions use
the same configuration parameters. It first replaces the register class for all
tile registers whose shape is known in compile-time. Since the shape of %1 is
constant, so it replaces %1:vtile with %1:vtile3x4 which change the register
class and morph pseudo instruction into AMX real instruction. The shape of %2
and %3 is unknown in compile-time, so it arbitrarily picks up a tile register
class which is not assigned before and assign the register class to %2 and %3.
After register class allocation, the code is transformed as this. The register
class for %2:vtile1x1 and %3:vtile1x2 is allocated.
   PLDTILECFG
  %1:vtile3x4  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4, %2:vtile1x1
Something I am not figured out.

a.       I not sure if we can have AMX instruction's inputs and outputs fit
multiple register classes (vtile1x1, ..., vtile16x16), otherwise we need 256
pseudo instructions.

b.       Whether 256 register class is enough to be allocated. There may be more
256 unknow shape tile registers.

c.       In this pass we also find the proper pointer (common dominator) to
insert ldtilecfg, but at this time the register is allocated, we don't know
the shape of each physical tile register. So we just insert a pseudo tile config
instruction.

3.       All tile register class share the same register unit. We do register
allocation by the framework, and the code is transformed as this.
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

4.       Run config pass to collect the shape of each physical tile register and
config them. The code can be generated as below. Here is the problem, how can we
know the shape of the physical tile register?
   MOV row, col info to %stack.0 for each physical tile register   ??????
  LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, implicit-def
$tmm1, implicit-def $tmm2, implicit-def $tmm3, implicit-def $tmm4, implicit-def
$tmm5, implicit-def $tmm6, implicit-def $tmm7
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

Thanks
Yuanke

From: Hal Finkel <hfinkel at anl.gov>
Sent: Friday, August 21, 2020 3:35 AM
To: Topper, Craig <craig.topper at intel.com>; Kaylor, Andrew
<andrew.kaylor at intel.com>; Luo, Yuanke <yuanke.luo at intel.com>;
Philip Reames <listmail at philipreames.com>; llvm-dev at lists.llvm.org;
florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 3:09 PM, Topper, Craig wrote:
The width and height can be runtime values that we would just copy into 64 byte
configuration block we pass to ldtilecfg. So the code doesn't need to be
multiversioned. The user code would also use those values to update pointers in
the loops they write using the tiles. If we can't determine that two tiles
were defined with the same width and height we need to assume the shape is
different and try to avoid ever giving the same tile.
Hal, for your suggestion would which physical registers are in which register
class be defined dynamically before register allocation?



Here's my thought:

First, you have a set of intrinsics that take tile values along with tile
configuration parameters (which, presently, seem just to be the sizes). These
get lowered into pseudo-instructions that do the same. Thus, you have some
register class that represents these arbitrarily-sized tile registers that
you'll assign to these pseudo-instruction operands (i.e., they take virtual
tile registers right after instruction selection). You might use the 16x16 tile
register class for this purpose, but it shouldn't really matter.

Second, you run this configuration-placement pass. This pass looks at all of the
AMX pseudo-instructions and identifies regions in which the pseudo-instructions
use the same configuration parameters (i.e., the same SSA values and/or
constants). This pass might reorder the pseudo-instructions when legal in order
to form larger regions. Then it places the ldtilecfg at the start of each region
(in some common dominating position). ldtilecfg implicitly defines all of the
tile registers in every concrete class of tile registers (all 256 of them, or
whatever). The pseudo-instructions are replaced by real MI instructions taking a
tile register class appropriate for the configuration (which will default to the
16x16 class for cases where the configuration is not a compile-time-known
constant). When the configuration is a known constant, the instructions take
operands with a register class appropriate for that configuration (e.g., 1x1,
4x4).

Third, the rest of the framework runs as usual. Tile registers from the
appropriate class are allocated by the register allocator. No live range of any
virtual tile register can pass through the ldtilecfg (because it defines them
all), but that's okay, none of live ranges will by construction (the
configuration-placement pass ensures this).

Thanks again,

Hal



From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 12:52 PM
To: Kaylor, Andrew <andrew.kaylor at intel.com><mailto:andrew.kaylor at
intel.com>; Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo
at intel.com>; Philip Reames <listmail at
philipreames.com><mailto:listmail at philipreames.com>; llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; florian_hahn at
apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 10:24 AM, Kaylor, Andrew wrote:
> When the tile shape is unknown at compile time, how do you plan to do the
register allocation of the tiles? My question is: do you do the allocation for
this case in the same way as you would if you knew the size was 16x16 (i.e.,
conservatively assume the largest size)?I think what will happen is that the registers are allocated based on a number
of runtime values that are assumed to be different from one another but less
than or equal to 16. So, for example, we'll allocate registers for MxN
tiles, NxM tiles and MxM tiles without knowing what M and N are. Then at runtime
the values of these variables will be used to create the actual tile
configuration. The instructions that need to know the shape take these runtime
values as operands.



So you're going to multiversion the code?

In any case, my point is that you probably don't need a custom register
allocator. If you just define the tile registers and make sure that the
ldtilecfgs implicitly defines them all, then the regular infrastructure likely
works. You'll have a bunch of register classes, but that's not
necessarily a problem. I recommend trying this, and let us know what you
discover, before we go down the road of a new, dedicated allocator just for
these registers.

 -Hal


There may be some artifacts coming from the front end that conservatively assume
a 16x16 tile, but I think those generally go away in SROA or later specialized
passes. Yuanke can confirm or correct my understanding of this.

From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 5:14 AM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Philip Reames
<listmail at philipreames.com><mailto:listmail at philipreames.com>;
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
florian_hahn at apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 5:34 AM, Luo, Yuanke wrote:
There is no problem to have 256 register classes. Just a lot of register classes
to me.
We don't assume the shape of each physical register be 16x16, it is defined
by user. For variable shape, I mean the shape is known in runtime and in compile
time the shape is unknown. Take below code as an example, the %row and %col are
variable instead of constant. Compiler recognizes llvm.x86.tileloadd64 and
deduce the shape of %0 is %row x %col.
%0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)



When the tile shape is unknown at compile time, how do you plan to do the
register allocation of the tiles? My question is: do you do the allocation for
this case in the same way as you would if you knew the size was 16x16 (i.e.,
conservatively assume the largest size)?

Thanks again,

Hal



From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 4:58 PM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Philip Reames
<listmail at philipreames.com><mailto:listmail at philipreames.com>;
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
florian_hahn at apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 2:21 AM, Luo, Yuanke wrote:

Hi Hal,
There is 3 aspect to be solved.

1.       The HW support max shape 16x16, so there are many register classes from
1x1 to 16x16. We need 256 register classes.

2.       We want to support variable shape, so compiler don't know what
register class to fit tile shape as it is only known in runtime.

3.       The tile configure is to configure physical tile register, so we need
to allocate register and then we know the shape of each physical tile register
and configure the tile register.
I think your suggestion is helpful to reduce the complexity if we only support
fixed (constant) tile shape.
-Yuanke



Thanks, Yuanke.

It's not clear to me that having 256 register classes is, in itself, a
problem. Is it?

What does it mean to support variable-shape tiles in this context? Do you do
something other than conservatively assume that they are 16x16 for
register-allocation purposes?

 -Hal



From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 8:20 AM
To: Kaylor, Andrew <andrew.kaylor at intel.com><mailto:andrew.kaylor at
intel.com>; Philip Reames <listmail at
philipreames.com><mailto:listmail at philipreames.com>; Luo, Yuanke
<yuanke.luo at intel.com><mailto:yuanke.luo at intel.com>; llvm-dev
at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; florian_hahn at
apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.


Hi, Andy,

I don't quite understand everything that's going on here. Could we model
this as:

 1. Define a collection of register classes, one for 2x4 tiles, one for 4x2
tiles, etc. each populated with a set of tile registers. Registers can have
aliasing relationships (instead of worrying of any kind of
subregister/superregister relationships -- these won't be useful anyway).

 2. Define the tile-configuration instructions so that they implicitly define
all of the registers in all of the classes.

Then you would still need to pre-schedule the tile operations as you've
described, and collect the configuration information in order to add the
ldtilecfgs, but the regular register allocator can handle the allocation itself
in the usual way. What do you think?

 -Hal
On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
The AMX registers are complicated. The single configuration register (which is
mostly used implicitly, similar to MXCSR for floating point) controls the shape
of all the tile registers, and if you change the tile configuration every single
tile register is cleared. In practice, if we have to change the the
configuration while any of the tile registers are live, performance is going to
be terrible. We need to handle this case for correctness, but users of this
programming interface will need to have enough awareness of the performance
issues and the hardware details to prevent this. We'll also want a
diagnostic that lets the user know when this has happened.

When the tile configuration is set, the shape of each tile is locked in, so the
individual tile registers aren't interchangeable at that point. If a
function needs 2x4 tiles, 4x2 tiles, and 4x4 tiles, the configuration needs to
be set with this in mind. The shape isn't explicit in every instruction and
intrinsic. It must be deduced. And again, we'll need a way to tell the user
when efficient allocation can't be done. In practice, I don't expect any
function to be using more than three tile shapes.

The implication of all this is that I don't think the greedy register
allocator is well suited to figure all of this out. We need a special pass to
pre-allocate these registers. If the function is written in a way that makes
good performance possible, it should be a relatively simple task to allocate
everything with minimal spilling. If it isn't possible to get good
performance, we don't need to do anything especially clever. We can just do
something straightforward that is correct and let the user know that they
aren't going to be happy with the results.

-Andy

From: Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>
Sent: Friday, August 14, 2020 8:29 PM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.


I find your answer unconvincing.  I'm not going to debate it as I don't
wish to take the time to build the appropriate context, but my initial response
is skepticism.

Philip
On 8/14/20 4:49 PM, Luo, Yuanke wrote:
[Yuanke] AMX register is special. It needs to be configured before use and the
config instruction is expensive. To avoid unnecessary tile configure, we collect
the tile shape information as much as possible and combine them into one
ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX
instruction that access tile register. On the other side, the ldtilecfg should
post-dominated the instruction that define the tile shape. For tile register
spill, it should avoid re-config due to the different tile shape, the spilled
register should be reloaded to the register that share the same tile shape.
Since tile register allocation is special and it may allocate general virtual
register to configure tile register, we can add a sperate pass to do it before
general register allocation pass. After register allocation, the tile shape
information is not needed anymore, so we can transform the pseudo AMX
instruction to real AMX instruction by removing the row and column operands.

[Philip]

This seems complicated.

Reading through the documentation, there appears to be a single global tile
config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the
tile instructions as having an implicit use of this register?  That would seem
to ensure that the register allocator has all the constraints needed.  You'd
need to teach it how to spill the special registers with the appropriate
instructions, but that seems a lot more straight forward?
[Yuanke] In that case user need to configure the tile register by themselves.
Spilling configure register is very expensive, because it clears all the tile
data register to zero. In our proposal, compiler is responsible to deduce the
shape for virtual of tile data register, allocate physical registers for them
and then configure those physical register. We may build the dependency as you
proposed and it can be used for machine IR check to ensure tile data register is
configured before use.

From: Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>
Sent: Saturday, August 15, 2020 1:17 AM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
Hi,
Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm
consisting of two components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory image, and
accelerators able to operate on tiles. Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are supported: palette 0 represents the
initialized state and palette 1 consists of 8 tile registers of up to 1 KB size,
which is controlled by a tile control register.
The instruction manual is posted at
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
The AMX abi proposal is posted at
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
This email is to discuss the programming model for AMX. Florian has introduced
the matrix type and intrinsics in LLVM community. We'd like to adopt some
ideas from it.
Here is what we propose for the AMX programming model.

1.        Data type.
We'd like to have fixed vector type for AMX. Since the shape to AMX register
can be configurable, the vector size is the maximum size of AMX register. That
means the vector size is 1024 bytes.
The C code may look like this.
typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));
_tile_data tile;
And the LLVM IR may look like this.
@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer,
align 64
For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX
registers.

2.       AMX Intrinsics.
The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n,
k identifies the shape of the tile. The shape can be variable, but it cannot
exceed the size that AMX HW can support. Compiler can deduce shape of the tile
from the AMX intrinsics.
_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);
_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data
tile);

3.       User interfaces.
The tile shape and tile data are combined into a struct in C language. The shape
of the tile is only allowed to be initialized once. The user interface looks as
this.
   3  #define __DEFAULT_FN_AMX    \
   4  __attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))
   9 typedef struct __tile_str {
10   const char row;
11   const short col;
12   _tile_data tile;
13 }__tile;
14
15 __DEFAULT_FN_AMX
16 void __tile_loadd(__tile *dst, const void *base, long stride) {
17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
18 }
19
20 __DEFAULT_FN_AMX
21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);
23 }
24
25 __DEFAULT_FN_AMX
26 void __tile_stored(void *base, long stride, __tile src) {
27   _tile_stored_internal(src.row, src.col, base, stride, src.tile);
28 }


4.       Example code
The example shows how to use the user interface in a function.
 51 void api(int cond, short row, short col) {
52   __tile a = {row, col};
53   __tile b = {row, col};
54   __tile c = {row, col};
55
56   if(cond) {
57     __tile_loadd(&a, buf, STRIDE);
58     __tile_loadd(&b, buf, STRIDE);
59     __tile_loadd(&c, buf, STRIDE);
60   } else {
61     __tile_loadd(&a, buf2, STRIDE);
62     __tile_loadd(&b, buf2, STRIDE);
63     __tile_loadd(&c, buf2, STRIDE);
64   }
65   __tile_dpbsud(&c, a, b);
66   __tile_stored(buf, STRIDE, c);
67 }

5.       LLVM IR
The LLVM intrinsics IR take the row and column information as the input
parameter, so that compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the LLVM IR corresponding
to the example code.
12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col)
local_unnamed_addr #2 {
13 entry:
14   %tobool = icmp eq i32 %cond, 0
15   %sext = shl i16 %col, 8
16   %conv.i31 = ashr exact i16 %sext, 8
17   br i1 %tobool, label %if.else, label %if.then
18
19 if.then:                                          ; preds = %entry
20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
23   br label %if.end
24
25 if.else:                                          ; preds = %entry
26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
29   br label %if.end
30
31 if.end:                                           ; preds = %if.else,
%if.then
32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31,
i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3
37   ret void
38 }

6.       Shape propagation
When in -O0 build, some general load/store for tile vector is generated by
front-end. We need to root from AMX intrinsics to propagate the shape
information to the virtual tile register. If the an AMX intrinsic use the result
of load instruction, the shape is propagated to the load and the load is
transformed to tile load intrinsic. If the store instruction uses any result of
AMX intrinsic, the shape is propagated to store instruction and the store is
transformed to tile store intrinsic

7.       Machine IR
Since the AMX intrinsics take the row and column as the input parameters, we can
create a pseudo instruction corresponding to it. The AMX intrinsics are lowered
to the pseudo AMX instruction which has extra row and column operands
corresponding to AMX intrinsic. The real AMX instructions don't need the row
and column operands. The row and column information should be configured by
ldtilecfg before executing any AMX instruction.

8.       Register allocation
AMX register is special. It needs to be configured before use and the config
instruction is expensive. To avoid unnecessary tile configure, we collect the
tile shape information as much as possible and combine them into one ldtilecfg
instruction. The ldtilecfg instruction should dominate any AMX instruction that
access tile register. On the other side, the ldtilecfg should post-dominated the
instruction that define the tile shape. For tile register spill, it should avoid
re-config due to the different tile shape, the spilled register should be
reloaded to the register that share the same tile shape. Since tile register
allocation is special and it may allocate general virtual register to configure
tile register, we can add a sperate pass to do it before general register
allocation pass. After register allocation, the tile shape information is not
needed anymore, so we can transform the pseudo AMX instruction to real AMX
instruction by removing the row and column operands.

This seems complicated.

Reading through the documentation, there appears to be a single global tile
config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the
tile instructions as having an implicit use of this register?  That would seem
to ensure that the register allocator has all the constraints needed.  You'd
need to teach it how to spill the special registers with the appropriate
instructions, but that seems a lot more straight forward?

9.       Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at
the entry of the function entry and inline function as much as possible. The AMX
instructions focus on computation instead of storage, so global variable for
tile data is not recommended.

Thanks
Yuanke









_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev







_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200821/37d18c18/attachment-0001.html>

Luo, Yuanke via llvm-dev

2020-Aug-22 02:54 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

It seems I make a mistake on sharing register unit. Can we share register unit
for tile register that is within different tile register class (different
register class has different tile shape)?  Think about two virtual tile register
%2:vtile1x1 and %3:vtile1x2. First %2 is allocated to $tmm0, after that %2 is
killed and %t3 is allocated to $tmm0. This is not allowed, because when $tmm0 is
allocated to %2, its shape is configured to 1x1. If we reallocated $tmm0 to %3,
then we need to re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be clobbered.

Yuanke

From: Luo, Yuanke
Sent: Friday, August 21, 2020 2:12 PM
To: Hal Finkel <hfinkel at anl.gov>; Topper, Craig <craig.topper at
intel.com>; Kaylor, Andrew <andrew.kaylor at intel.com>; Philip Reames
<listmail at philipreames.com>; llvm-dev at lists.llvm.org; florian_hahn
at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
Subject: RE: [llvm-dev] Intel AMX programming model discussion.

Hi Hal,
The proposal is attractive to me, but there is something I still can't
figure out. Let's take below MIR as an example. We assume we have 256
register classes (vtile1x1, vtile1x2, ..., tile16x16).

1.       After instruction selection, the pseudo AMX instruction is generated.
The name of pseudo instructions have 'P' prefix. Now all the AMX pseudo
instruction take vtile as register class. Let's assume %13 is constant 3,
%10 is constant 4 and %14 is variable.
  %1:vtile = PTILELOADDV %13:gr16, %10:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %2:vtile = PTILELOADDV %10:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %3:vtile = PTILELOADDV %13:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
%21:vtile = PTDPBSSDV %13:gr16, %10:gr16, %14:gr16, %3:vtile(tied-def 0),
%1:vtile, %2:vtile

2.       The configuration-placement pass looks at all of the AMX
pseudo-instructions and identifies regions in which the pseudo-instructions use
the same configuration parameters. It first replaces the register class for all
tile registers whose shape is known in compile-time. Since the shape of %1 is
constant, so it replaces %1:vtile with %1:vtile3x4 which change the register
class and morph pseudo instruction into AMX real instruction. The shape of %2
and %3 is unknown in compile-time, so it arbitrarily picks up a tile register
class which is not assigned before and assign the register class to %2 and %3.
After register class allocation, the code is transformed as this. The register
class for %2:vtile1x1 and %3:vtile1x2 is allocated.
   PLDTILECFG
  %1:vtile3x4  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4, %2:vtile1x1
Something I am not figured out.

a.       I not sure if we can have AMX instruction's inputs and outputs fit
multiple register classes (vtile1x1, ..., vtile16x16), otherwise we need 256
pseudo instructions.

b.       Whether 256 register class is enough to be allocated. There may be more
256 unknow shape tile registers.

c.       In this pass we also find the proper pointer (common dominator) to
insert ldtilecfg, but at this time the register is allocated, we don't know
the shape of each physical tile register. So we just insert a pseudo tile config
instruction.

3.       All tile register class share the same register unit. We do register
allocation by the framework, and the code is transformed as this.
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

4.       Run config pass to collect the shape of each physical tile register and
config them. The code can be generated as below. Here is the problem, how can we
know the shape of the physical tile register?
   MOV row, col info to %stack.0 for each physical tile register   ??????
  LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, implicit-def
$tmm1, implicit-def $tmm2, implicit-def $tmm3, implicit-def $tmm4, implicit-def
$tmm5, implicit-def $tmm6, implicit-def $tmm7
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

Thanks
Yuanke

From: Hal Finkel <hfinkel at anl.gov<mailto:hfinkel at anl.gov>>
Sent: Friday, August 21, 2020 3:35 AM
To: Topper, Craig <craig.topper at intel.com<mailto:craig.topper at
intel.com>>; Kaylor, Andrew <andrew.kaylor at
intel.com<mailto:andrew.kaylor at intel.com>>; Luo, Yuanke
<yuanke.luo at intel.com<mailto:yuanke.luo at intel.com>>; Philip
Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Lu, Hongjiu <hongjiu.lu at intel.com<mailto:hongjiu.lu at
intel.com>>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 3:09 PM, Topper, Craig wrote:
The width and height can be runtime values that we would just copy into 64 byte
configuration block we pass to ldtilecfg. So the code doesn't need to be
multiversioned. The user code would also use those values to update pointers in
the loops they write using the tiles. If we can't determine that two tiles
were defined with the same width and height we need to assume the shape is
different and try to avoid ever giving the same tile.
Hal, for your suggestion would which physical registers are in which register
class be defined dynamically before register allocation?



Here's my thought:

First, you have a set of intrinsics that take tile values along with tile
configuration parameters (which, presently, seem just to be the sizes). These
get lowered into pseudo-instructions that do the same. Thus, you have some
register class that represents these arbitrarily-sized tile registers that
you'll assign to these pseudo-instruction operands (i.e., they take virtual
tile registers right after instruction selection). You might use the 16x16 tile
register class for this purpose, but it shouldn't really matter.

Second, you run this configuration-placement pass. This pass looks at all of the
AMX pseudo-instructions and identifies regions in which the pseudo-instructions
use the same configuration parameters (i.e., the same SSA values and/or
constants). This pass might reorder the pseudo-instructions when legal in order
to form larger regions. Then it places the ldtilecfg at the start of each region
(in some common dominating position). ldtilecfg implicitly defines all of the
tile registers in every concrete class of tile registers (all 256 of them, or
whatever). The pseudo-instructions are replaced by real MI instructions taking a
tile register class appropriate for the configuration (which will default to the
16x16 class for cases where the configuration is not a compile-time-known
constant). When the configuration is a known constant, the instructions take
operands with a register class appropriate for that configuration (e.g., 1x1,
4x4).

Third, the rest of the framework runs as usual. Tile registers from the
appropriate class are allocated by the register allocator. No live range of any
virtual tile register can pass through the ldtilecfg (because it defines them
all), but that's okay, none of live ranges will by construction (the
configuration-placement pass ensures this).

Thanks again,

Hal



From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 12:52 PM
To: Kaylor, Andrew <andrew.kaylor at intel.com><mailto:andrew.kaylor at
intel.com>; Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo
at intel.com>; Philip Reames <listmail at
philipreames.com><mailto:listmail at philipreames.com>; llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; florian_hahn at
apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 10:24 AM, Kaylor, Andrew wrote:
> When the tile shape is unknown at compile time, how do you plan to do the
register allocation of the tiles? My question is: do you do the allocation for
this case in the same way as you would if you knew the size was 16x16 (i.e.,
conservatively assume the largest size)?I think what will happen is that the registers are allocated based on a number
of runtime values that are assumed to be different from one another but less
than or equal to 16. So, for example, we'll allocate registers for MxN
tiles, NxM tiles and MxM tiles without knowing what M and N are. Then at runtime
the values of these variables will be used to create the actual tile
configuration. The instructions that need to know the shape take these runtime
values as operands.



So you're going to multiversion the code?

In any case, my point is that you probably don't need a custom register
allocator. If you just define the tile registers and make sure that the
ldtilecfgs implicitly defines them all, then the regular infrastructure likely
works. You'll have a bunch of register classes, but that's not
necessarily a problem. I recommend trying this, and let us know what you
discover, before we go down the road of a new, dedicated allocator just for
these registers.

 -Hal


There may be some artifacts coming from the front end that conservatively assume
a 16x16 tile, but I think those generally go away in SROA or later specialized
passes. Yuanke can confirm or correct my understanding of this.

From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 5:14 AM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Philip Reames
<listmail at philipreames.com><mailto:listmail at philipreames.com>;
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
florian_hahn at apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 5:34 AM, Luo, Yuanke wrote:
There is no problem to have 256 register classes. Just a lot of register classes
to me.
We don't assume the shape of each physical register be 16x16, it is defined
by user. For variable shape, I mean the shape is known in runtime and in compile
time the shape is unknown. Take below code as an example, the %row and %col are
variable instead of constant. Compiler recognizes llvm.x86.tileloadd64 and
deduce the shape of %0 is %row x %col.
%0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)



When the tile shape is unknown at compile time, how do you plan to do the
register allocation of the tiles? My question is: do you do the allocation for
this case in the same way as you would if you knew the size was 16x16 (i.e.,
conservatively assume the largest size)?

Thanks again,

Hal



From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 4:58 PM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Philip Reames
<listmail at philipreames.com><mailto:listmail at philipreames.com>;
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
florian_hahn at apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/19/20 2:21 AM, Luo, Yuanke wrote:

Hi Hal,
There is 3 aspect to be solved.

1.       The HW support max shape 16x16, so there are many register classes from
1x1 to 16x16. We need 256 register classes.

2.       We want to support variable shape, so compiler don't know what
register class to fit tile shape as it is only known in runtime.

3.       The tile configure is to configure physical tile register, so we need
to allocate register and then we know the shape of each physical tile register
and configure the tile register.
I think your suggestion is helpful to reduce the complexity if we only support
fixed (constant) tile shape.
-Yuanke



Thanks, Yuanke.

It's not clear to me that having 256 register classes is, in itself, a
problem. Is it?

What does it mean to support variable-shape tiles in this context? Do you do
something other than conservatively assume that they are 16x16 for
register-allocation purposes?

 -Hal



From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, August 19, 2020 8:20 AM
To: Kaylor, Andrew <andrew.kaylor at intel.com><mailto:andrew.kaylor at
intel.com>; Philip Reames <listmail at
philipreames.com><mailto:listmail at philipreames.com>; Luo, Yuanke
<yuanke.luo at intel.com><mailto:yuanke.luo at intel.com>; llvm-dev
at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; florian_hahn at
apple.com<mailto:florian_hahn at apple.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.


Hi, Andy,

I don't quite understand everything that's going on here. Could we model
this as:

 1. Define a collection of register classes, one for 2x4 tiles, one for 4x2
tiles, etc. each populated with a set of tile registers. Registers can have
aliasing relationships (instead of worrying of any kind of
subregister/superregister relationships -- these won't be useful anyway).

 2. Define the tile-configuration instructions so that they implicitly define
all of the registers in all of the classes.

Then you would still need to pre-schedule the tile operations as you've
described, and collect the configuration information in order to add the
ldtilecfgs, but the regular register allocator can handle the allocation itself
in the usual way. What do you think?

 -Hal
On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
The AMX registers are complicated. The single configuration register (which is
mostly used implicitly, similar to MXCSR for floating point) controls the shape
of all the tile registers, and if you change the tile configuration every single
tile register is cleared. In practice, if we have to change the the
configuration while any of the tile registers are live, performance is going to
be terrible. We need to handle this case for correctness, but users of this
programming interface will need to have enough awareness of the performance
issues and the hardware details to prevent this. We'll also want a
diagnostic that lets the user know when this has happened.

When the tile configuration is set, the shape of each tile is locked in, so the
individual tile registers aren't interchangeable at that point. If a
function needs 2x4 tiles, 4x2 tiles, and 4x4 tiles, the configuration needs to
be set with this in mind. The shape isn't explicit in every instruction and
intrinsic. It must be deduced. And again, we'll need a way to tell the user
when efficient allocation can't be done. In practice, I don't expect any
function to be using more than three tile shapes.

The implication of all this is that I don't think the greedy register
allocator is well suited to figure all of this out. We need a special pass to
pre-allocate these registers. If the function is written in a way that makes
good performance possible, it should be a relatively simple task to allocate
everything with minimal spilling. If it isn't possible to get good
performance, we don't need to do anything especially clever. We can just do
something straightforward that is correct and let the user know that they
aren't going to be happy with the results.

-Andy

From: Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>
Sent: Friday, August 14, 2020 8:29 PM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.


I find your answer unconvincing.  I'm not going to debate it as I don't
wish to take the time to build the appropriate context, but my initial response
is skepticism.

Philip
On 8/14/20 4:49 PM, Luo, Yuanke wrote:
[Yuanke] AMX register is special. It needs to be configured before use and the
config instruction is expensive. To avoid unnecessary tile configure, we collect
the tile shape information as much as possible and combine them into one
ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX
instruction that access tile register. On the other side, the ldtilecfg should
post-dominated the instruction that define the tile shape. For tile register
spill, it should avoid re-config due to the different tile shape, the spilled
register should be reloaded to the register that share the same tile shape.
Since tile register allocation is special and it may allocate general virtual
register to configure tile register, we can add a sperate pass to do it before
general register allocation pass. After register allocation, the tile shape
information is not needed anymore, so we can transform the pseudo AMX
instruction to real AMX instruction by removing the row and column operands.

[Philip]

This seems complicated.

Reading through the documentation, there appears to be a single global tile
config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the
tile instructions as having an implicit use of this register?  That would seem
to ensure that the register allocator has all the constraints needed.  You'd
need to teach it how to spill the special registers with the appropriate
instructions, but that seems a lot more straight forward?
[Yuanke] In that case user need to configure the tile register by themselves.
Spilling configure register is very expensive, because it clears all the tile
data register to zero. In our proposal, compiler is responsible to deduce the
shape for virtual of tile data register, allocate physical registers for them
and then configure those physical register. We may build the dependency as you
proposed and it can be used for machine IR check to ensure tile data register is
configured before use.

From: Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>
Sent: Saturday, August 15, 2020 1:17 AM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Topper, Craig
<craig.topper at intel.com><mailto:craig.topper at intel.com>; Lu,
Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
Hi,
Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm
consisting of two components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory image, and
accelerators able to operate on tiles. Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are supported: palette 0 represents the
initialized state and palette 1 consists of 8 tile registers of up to 1 KB size,
which is controlled by a tile control register.
The instruction manual is posted at
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
The AMX abi proposal is posted at
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
This email is to discuss the programming model for AMX. Florian has introduced
the matrix type and intrinsics in LLVM community. We'd like to adopt some
ideas from it.
Here is what we propose for the AMX programming model.

1.        Data type.
We'd like to have fixed vector type for AMX. Since the shape to AMX register
can be configurable, the vector size is the maximum size of AMX register. That
means the vector size is 1024 bytes.
The C code may look like this.
typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));
_tile_data tile;
And the LLVM IR may look like this.
@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer,
align 64
For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX
registers.

2.       AMX Intrinsics.
The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n,
k identifies the shape of the tile. The shape can be variable, but it cannot
exceed the size that AMX HW can support. Compiler can deduce shape of the tile
from the AMX intrinsics.
_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);
_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data
tile);

3.       User interfaces.
The tile shape and tile data are combined into a struct in C language. The shape
of the tile is only allowed to be initialized once. The user interface looks as
this.
   3  #define __DEFAULT_FN_AMX    \
   4  __attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))
   9 typedef struct __tile_str {
10   const char row;
11   const short col;
12   _tile_data tile;
13 }__tile;
14
15 __DEFAULT_FN_AMX
16 void __tile_loadd(__tile *dst, const void *base, long stride) {
17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
18 }
19
20 __DEFAULT_FN_AMX
21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);
23 }
24
25 __DEFAULT_FN_AMX
26 void __tile_stored(void *base, long stride, __tile src) {
27   _tile_stored_internal(src.row, src.col, base, stride, src.tile);
28 }


4.       Example code
The example shows how to use the user interface in a function.
 51 void api(int cond, short row, short col) {
52   __tile a = {row, col};
53   __tile b = {row, col};
54   __tile c = {row, col};
55
56   if(cond) {
57     __tile_loadd(&a, buf, STRIDE);
58     __tile_loadd(&b, buf, STRIDE);
59     __tile_loadd(&c, buf, STRIDE);
60   } else {
61     __tile_loadd(&a, buf2, STRIDE);
62     __tile_loadd(&b, buf2, STRIDE);
63     __tile_loadd(&c, buf2, STRIDE);
64   }
65   __tile_dpbsud(&c, a, b);
66   __tile_stored(buf, STRIDE, c);
67 }

5.       LLVM IR
The LLVM intrinsics IR take the row and column information as the input
parameter, so that compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the LLVM IR corresponding
to the example code.
12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col)
local_unnamed_addr #2 {
13 entry:
14   %tobool = icmp eq i32 %cond, 0
15   %sext = shl i16 %col, 8
16   %conv.i31 = ashr exact i16 %sext, 8
17   br i1 %tobool, label %if.else, label %if.then
18
19 if.then:                                          ; preds = %entry
20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
23   br label %if.end
24
25 if.else:                                          ; preds = %entry
26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
29   br label %if.end
30
31 if.end:                                           ; preds = %if.else,
%if.then
32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31,
i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3
37   ret void
38 }

6.       Shape propagation
When in -O0 build, some general load/store for tile vector is generated by
front-end. We need to root from AMX intrinsics to propagate the shape
information to the virtual tile register. If the an AMX intrinsic use the result
of load instruction, the shape is propagated to the load and the load is
transformed to tile load intrinsic. If the store instruction uses any result of
AMX intrinsic, the shape is propagated to store instruction and the store is
transformed to tile store intrinsic

7.       Machine IR
Since the AMX intrinsics take the row and column as the input parameters, we can
create a pseudo instruction corresponding to it. The AMX intrinsics are lowered
to the pseudo AMX instruction which has extra row and column operands
corresponding to AMX intrinsic. The real AMX instructions don't need the row
and column operands. The row and column information should be configured by
ldtilecfg before executing any AMX instruction.

8.       Register allocation
AMX register is special. It needs to be configured before use and the config
instruction is expensive. To avoid unnecessary tile configure, we collect the
tile shape information as much as possible and combine them into one ldtilecfg
instruction. The ldtilecfg instruction should dominate any AMX instruction that
access tile register. On the other side, the ldtilecfg should post-dominated the
instruction that define the tile shape. For tile register spill, it should avoid
re-config due to the different tile shape, the spilled register should be
reloaded to the register that share the same tile shape. Since tile register
allocation is special and it may allocate general virtual register to configure
tile register, we can add a sperate pass to do it before general register
allocation pass. After register allocation, the tile shape information is not
needed anymore, so we can transform the pseudo AMX instruction to real AMX
instruction by removing the row and column operands.

This seems complicated.

Reading through the documentation, there appears to be a single global tile
config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the
tile instructions as having an implicit use of this register?  That would seem
to ensure that the register allocator has all the constraints needed.  You'd
need to teach it how to spill the special registers with the appropriate
instructions, but that seems a lot more straight forward?

9.       Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at
the entry of the function entry and inline function as much as possible. The AMX
instructions focus on computation instead of storage, so global variable for
tile data is not recommended.

Thanks
Yuanke








_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev






_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200822/dba68a7b/attachment-0001.html>

Hal Finkel via llvm-dev

2020-Aug-24 09:02 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

Hi, Yuanke,

Thanks for writing this up. Let me back up a bit because the scheme I 
proposed last week doesn't work without further modification: within a 
particular "configuration region" (i.e., the code in between the 
LDTILECFG and the TILERELEASE (or next LDTILECFG)), each tile register 
can only be used with one shape, and in addition, no register can have 
its shape changed without zeroing out all of the tile registers. Thus, 
just using different register classes for the different shapes, as I had 
suggested, isn't sufficient to model the allocation requirements. That 
would not prevent the same register from essentially being assigned to 
differently-shaped virtual registers with non-overlapping live ranges 
within one configuration region.

Also, as you point out, when multiple non-static tile shapes are in use, 
if you use one register class for each shape, you would need different 
register classes for these too. Luckily, I don't think that using the 
separate register classes actually buys us anything, so please disregard 
that suggestion of mine. Use only one register class.

Once the configuration regions are identified, you'll know how many tile 
register shapes are required. If this number is greater than eight, then 
you'll need to cut the region (requiring all live tiles to be spilled 
and restored around each re-configuration point). After that, we'll 
assume that we have eight or fewer distinct shapes.

Now the problem is that you need to allocate registers, satisfying all 
of the usual constraints (non-overlapping live ranges, etc.), but with 
an additional constraint: once a physical register has been used with 
some particular tile shape, it cannot be assigned to any other tile shape.

I think that the current infrastructure can support this as follows:

  1. Add an override X86RegisterInfo::getRegAllocationHints. Like 
SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting 
the tile registers, the function will return true (to indicate a hard 
constraint). As registers are assigned in RegAllocGreedy, 
getRegAllocationHints is called for each virtual register. For virtual 
tile registers, look at the passed VirtRegMap, etc. for already-assigned 
tile virtual registers with different shape requirements as the current 
virtual register (you'll need to cache the shape requirements in 
X86MachineFunctionInfo for this to be efficient), and return a hints 
list consisting of all other non-reserved tile registers.

  2. To support RegAllocFast, which doesn't use getRegAllocationHints, 
you would need to make the configuration regions small enough that it 
doesn't matter (and if you're doing this around every tile instruction, 
this is automatically true).

  3. To support RegAllocPBQP (which is likely a good thing to do, but 
probably not required), I believe you can support this by adding custom 
constraints to the solver (kind of like what AArch64PBQPRegAlloc.cpp does).

Once the allocation process is complete, you'll need to go back and 
update the LDTILECFG data to reflect the chosen shape -> register mapping.

What I don't know, however, is how well the getRegAllocationHints method 
will work. The benefit is that you don't need to write a custom 
pre-allocator allocator. On the other hand, it might visit the virtual 
registers to assign in a suboptimal order because it doesn't really 
understand the constraint being imposed (generally, we just assign 
larger live ranges first). On the other hand, it is a greedy algorithm 
and if you want something systematically closer to optimal, maybe you 
should be using PBQP anyway. If you do end up needing a custom allocator 
for these, I recommend looking at the PBQP solver (which, as I recall, 
is independently reusable).

Hopefully, this is more-helpful advice.

  -Hal

On 8/21/20 9:54 PM, Luo, Yuanke wrote:>
> It seems I make a mistake on sharing register unit. Can we share 
> register unit for tile register that is within different tile register 
> class (different register class has different tile shape)?  Think 
> about two virtual tile register /%2:vtile1x1 /and /%3:vtile1x2/. First 
> %2 is allocated to $tmm0, after that %2 is killed and %t3 is allocated 
> to $tmm0. This is not allowed, because when $tmm0 is allocated to %2, 
> its shape is configured to 1x1. If we reallocated $tmm0 to %3, then we 
> need to re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be clobbered.
>
> Yuanke
>
> *From:* Luo, Yuanke
> *Sent:* Friday, August 21, 2020 2:12 PM
> *To:* Hal Finkel <hfinkel at anl.gov>; Topper, Craig 
> <craig.topper at intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com>;
> Philip Reames <listmail at philipreames.com>; llvm-dev at
lists.llvm.org;
> florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* RE: [llvm-dev] Intel AMX programming model discussion.
>
> Hi Hal,
>
> The proposal is attractive to me, but there is something I still can’t 
> figure out. Let’s take below MIR as an example. We assume we have 256 
> register classes (vtile1x1, vtile1x2, …, tile16x16).
>
> 1.After instruction selection, the pseudo AMX instruction is 
> generated. The name of pseudo instructions have ‘P’ prefix. Now all 
> the AMX pseudo instruction take vtile as register class. Let’s assume 
> %13 is constant 3, %10 is constant 4 and %14 is variable.
>
> /  %1:vtile = *P*TILELOADDV %13:gr16, %10:gr16, %17:gr64, 1, 
> %18:gr64_nosp, 0, $noreg/
>
> /  %2:vtile = *P*TILELOADDV %10:gr16, %14:gr16, %17:gr64, 1, 
> %18:gr64_nosp, 0, $noreg/
>
> /  %3:vtile = *P*TILELOADDV %13:gr16, %14:gr16, %17:gr64, 1, 
> %18:gr64_nosp, 0, $noreg/
>
> /%21:vtile = *P*TDPBSSDV %13:gr16, %10:gr16, %14:gr16, 
> %3:vtile(tied-def 0), %1:vtile, %2:vtile /
>
> 2.The configuration-placement pass looks at all of the AMX 
> pseudo-instructions and identifies regions in which the 
> pseudo-instructions use the same configuration parameters. It first 
> replaces the register class for all tile registers whose shape is 
> known in compile-time. Since the shape of %1 is constant, so it 
> replaces %1:vtile with %1:vtile3x4 which change the register class and 
> morph pseudo instruction into AMX real instruction. The shape of %2 
> and %3 is unknown in compile-time, so it arbitrarily picks up a tile 
> register class which is not assigned before and assign the register 
> class to %2 and %3. After register class allocation, the code is 
> transformed as this. The register class for %2:vtile1x1 and 
> %3:vtile1x2 is allocated.
>
> /*P*LDTILECFG/
>
> /  %1:vtile3x4  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4, 
> %2:vtile1x1 /
>
> Something I am not figured out.
>
> a.I not sure if we can have AMX instruction’s inputs and outputs fit 
> multiple register classes (vtile1x1, …, vtile16x16), otherwise we need 
> 256 pseudo instructions.
>
> b.Whether 256 register class is enough to be allocated. There may be 
> more 256 unknow shape tile registers.
>
> c.In this pass we also find the proper pointer (common dominator) to 
> insert ldtilecfg, but at this time the register is allocated, we don’t 
> know the shape of each physical tile register. So we just insert a 
> pseudo tile config instruction.
>
> 3.All tile register class share the same register unit. We do register 
> allocation by the framework, and the code is transformed as this.
>
> /  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
> 4.Run config pass to collect the shape of each physical tile register 
> and config them. The code can be generated as below. Here is the 
> problem, how can we know the shape of the physical tile register?
>
> */   MOV row, col info to %stack.0 for each physical tile register   
> ??????/*
>
> */  LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, 
> implicit-def $tmm1, implicit-def $tmm2, implicit-def $tmm3, 
> implicit-def $tmm4, implicit-def $tmm5, implicit-def $tmm6, 
> implicit-def $tmm7/*
>
> /  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
> Thanks
>
> Yuanke
>
> ...
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200824/46a9eb84/attachment.html>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Aug 2020 - Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

Reasonably Related Threads