thr3ads.net - llvm dev - [llvm-dev] Intel AMX programming model discussion. [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Luo, Yuanke via llvm-dev

2020-Aug-14 13:27 UTC

[llvm-dev] Intel AMX programming model discussion.

Hi,
Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm
consisting of two components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory image, and
accelerators able to operate on tiles. Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are supported: palette 0 represents the
initialized state and palette 1 consists of 8 tile registers of up to 1 KB size,
which is controlled by a tile control register.
The instruction manual is posted at
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
The AMX abi proposal is posted at
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
This email is to discuss the programming model for AMX. Florian has introduced
the matrix type and intrinsics in LLVM community. We'd like to adopt some
ideas from it.
Here is what we propose for the AMX programming model.

1.        Data type.
We'd like to have fixed vector type for AMX. Since the shape to AMX register
can be configurable, the vector size is the maximum size of AMX register. That
means the vector size is 1024 bytes.
The C code may look like this.
typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));
_tile_data tile;
And the LLVM IR may look like this.
@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer,
align 64
For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX
registers.

2.       AMX Intrinsics.
The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n,
k identifies the shape of the tile. The shape can be variable, but it cannot
exceed the size that AMX HW can support. Compiler can deduce shape of the tile
from the AMX intrinsics.
_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);
_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data
tile);

3.       User interfaces.
The tile shape and tile data are combined into a struct in C language. The shape
of the tile is only allowed to be initialized once. The user interface looks as
this.
   3  #define __DEFAULT_FN_AMX    \
   4  __attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))
   9 typedef struct __tile_str {
10   const char row;
11   const short col;
12   _tile_data tile;
13 }__tile;
14
15 __DEFAULT_FN_AMX
16 void __tile_loadd(__tile *dst, const void *base, long stride) {
17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
18 }
19
20 __DEFAULT_FN_AMX
21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);
23 }
24
25 __DEFAULT_FN_AMX
26 void __tile_stored(void *base, long stride, __tile src) {
27   _tile_stored_internal(src.row, src.col, base, stride, src.tile);
28 }


4.       Example code
The example shows how to use the user interface in a function.
 51 void api(int cond, short row, short col) {
52   __tile a = {row, col};
53   __tile b = {row, col};
54   __tile c = {row, col};
55
56   if(cond) {
57     __tile_loadd(&a, buf, STRIDE);
58     __tile_loadd(&b, buf, STRIDE);
59     __tile_loadd(&c, buf, STRIDE);
60   } else {
61     __tile_loadd(&a, buf2, STRIDE);
62     __tile_loadd(&b, buf2, STRIDE);
63     __tile_loadd(&c, buf2, STRIDE);
64   }
65   __tile_dpbsud(&c, a, b);
66   __tile_stored(buf, STRIDE, c);
67 }

5.       LLVM IR
The LLVM intrinsics IR take the row and column information as the input
parameter, so that compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the LLVM IR corresponding
to the example code.
12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col)
local_unnamed_addr #2 {
13 entry:
14   %tobool = icmp eq i32 %cond, 0
15   %sext = shl i16 %col, 8
16   %conv.i31 = ashr exact i16 %sext, 8
17   br i1 %tobool, label %if.else, label %if.then
18
19 if.then:                                          ; preds = %entry
20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
23   br label %if.end
24
25 if.else:                                          ; preds = %entry
26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
29   br label %if.end
30
31 if.end:                                           ; preds = %if.else,
%if.then
32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31,
i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3
37   ret void
38 }

6.       Shape propagation
When in -O0 build, some general load/store for tile vector is generated by
front-end. We need to root from AMX intrinsics to propagate the shape
information to the virtual tile register. If the an AMX intrinsic use the result
of load instruction, the shape is propagated to the load and the load is
transformed to tile load intrinsic. If the store instruction uses any result of
AMX intrinsic, the shape is propagated to store instruction and the store is
transformed to tile store intrinsic

7.       Machine IR
Since the AMX intrinsics take the row and column as the input parameters, we can
create a pseudo instruction corresponding to it. The AMX intrinsics are lowered
to the pseudo AMX instruction which has extra row and column operands
corresponding to AMX intrinsic. The real AMX instructions don't need the row
and column operands. The row and column information should be configured by
ldtilecfg before executing any AMX instruction.

8.       Register allocation
AMX register is special. It needs to be configured before use and the config
instruction is expensive. To avoid unnecessary tile configure, we collect the
tile shape information as much as possible and combine them into one ldtilecfg
instruction. The ldtilecfg instruction should dominate any AMX instruction that
access tile register. On the other side, the ldtilecfg should post-dominated the
instruction that define the tile shape. For tile register spill, it should avoid
re-config due to the different tile shape, the spilled register should be
reloaded to the register that share the same tile shape. Since tile register
allocation is special and it may allocate general virtual register to configure
tile register, we can add a sperate pass to do it before general register
allocation pass. After register allocation, the tile shape information is not
needed anymore, so we can transform the pseudo AMX instruction to real AMX
instruction by removing the row and column operands.

9.       Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at
the entry of the function entry and inline function as much as possible. The AMX
instructions focus on computation instead of storage, so global variable for
tile data is not recommended.

Thanks
Yuanke
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/a6104f3b/attachment.html>

Hal Finkel via llvm-dev

2020-Aug-14 15:26 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

On 8/14/20 8:27 AM, Luo, Yuanke via llvm-dev wrote:>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new programming 
> paradigm consisting of two components: a set of 2-dimensional 
> registers (tiles) representing sub-arrays from a larger 2-dimensional 
> memory image, and accelerators able to operate on tiles. Capability of 
> Intel AMX implementation is enumerated by palettes. Two palettes are 
> supported: palette 0 represents the initialized state and palette 1 
> consists of 8 tile registers of up to 1 KB size, which is controlled 
> by a tile control register.
>
> The instruction manual is posted at 
>
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>
<https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
> The AMX abi proposal is posted at 
> https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4 
> <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
> This email is to discuss the programming model for AMX. Florian has 
> introduced the matrix type and intrinsics in LLVM community. We’d like 
> to adopt some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1. Data type.
>
> We’d like to have fixed vector type for AMX. Since the shape to AMX 
> register can be configurable, the vector size is the maximum size of 
> AMX register. That means the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data __attribute__((__vector_size__(1024), 
> __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x i32> 
> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile that can be 
> mapped to AMX registers.
>
> 2.AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX instructions. The 
> parameter m, n, k identifies the shape of the tile. The shape can be 
> variable, but it cannot exceed the size that AMX HW can support. 
> Compiler can deduce shape of the tile from the AMX intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const void *base, int 
> stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data 
> dst, _tile_data src1, _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n, short k, 
> _tile_data dst, _tile_data src1, _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void *base, int stride, 
> _tile_data tile);
>
> 3.User interfaces.
>
> The tile shape and tile data are combined into a struct in C language. 
> The shape of the tile is only allowed to be initialized once. The user 
> interface looks as this.
>
>    3  #define __DEFAULT_FN_AMX    \
>
>    4 __attribute__((__always_inline__, __nodebug__, 
> __target__("amx-int8")))
>
>    9 typedef struct __tile_str {
>
> 10   const char row;
>
> 11   const short col;
>
> 12   _tile_data tile;
>
> 13 }__tile;
>
This interface look convenient, but what happens if one of these types 
appears on a function-call boundary? Does this force everything to be 
spilled and restored from the stack? Maybe this type needs some 
additional attribute to give it a custom register-passing convention?

> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base, long stride) {
>
> 17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
> 22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, 
> dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile src) {
>
> 27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);
>
> 28 }
>
> 4.Example code
>
> The example shows how to use the user interface in a function.
>
>  51 void api(int cond, short row, short col) {
>
> 52   __tile a = {row, col};
>
> 53   __tile b = {row, col};
>
> 54   __tile c = {row, col};
>
> 55
>
> 56   if(cond) {
>
> 57 __tile_loadd(&a, buf, STRIDE);
>
> 58 __tile_loadd(&b, buf, STRIDE);
>
> 59 __tile_loadd(&c, buf, STRIDE);
>
> 60   } else {
>
> 61 __tile_loadd(&a, buf2, STRIDE);
>
> 62 __tile_loadd(&b, buf2, STRIDE);
>
> 63 __tile_loadd(&c, buf2, STRIDE);
>
> 64   }
>
> 65   __tile_dpbsud(&c, a, b);
>
> 66   __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.LLVM IR
>
> The LLVM intrinsics IR take the row and column information as the 
> input parameter, so that compiler can deduce the shape of tile data. 
> The remaining parameters are what AMX instructions require. This is 
> the LLVM IR corresponding to the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext 
> %col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14   %tobool = icmp eq i32 %cond, 0
>
> 15   %sext = shl i16 %col, 8
>
> 16   %conv.i31 = ashr exact i16 %sext, 8
>
> 17   br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then:                                          ; preds = %entry
>
> 20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, 
> i64 0, i64 0), i64 32) #3
>
> 21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, 
> i64 0, i64 0), i64 32) #3
>
> 22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, 
> i64 0, i64 0), i64 32) #3
>
> 23   br label %if.end
>
> 24
>
> 25 if.else:                                          ; preds = %entry
>
> 26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* 
> @buf2, i64 0, i64 0), i64 32) #3
>
> 27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* 
> @buf2, i64 0, i64 0), i64 32) #3
>
> 28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* 
> @buf2, i64 0, i64 0), i64 32) #3
>
> 29   br label %if.end
>
> 30
>
> 31 if.end:                                           ; preds = 
> %if.else, %if.then
>
> 32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0,
%if.then ]
>
> 33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1,
%if.then ]
>
> 34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2,
%if.then ]
>
> 35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 
> %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x
i32>
> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
> 36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, 
> i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 
> 0), i64 32, <256 x i32> %6) #3
>
> 37   ret void
>
> 38 }
>
> 6.Shape propagation
>
> When in -O0 build, some general load/store for tile vector is 
> generated by front-end. We need to root from AMX intrinsics to 
> propagate the shape information to the virtual tile register. If the 
> an AMX intrinsic use the result of load instruction, the shape is 
> propagated to the load and the load is transformed to tile load 
> intrinsic. If the store instruction uses any result of AMX intrinsic, 
> the shape is propagated to store instruction and the store is 
> transformed to tile store intrinsic
>
> 7.Machine IR
>
> Since the AMX intrinsics take the row and column as the input 
> parameters, we can create a pseudo instruction corresponding to it. 
> The AMX intrinsics are lowered to the pseudo AMX instruction which has 
> extra row and column operands corresponding to AMX intrinsic. The real 
> AMX instructions don’t need the row and column operands. The row and 
> column information should be configured by ldtilecfg before executing 
> any AMX instruction.
>
> 8.Register allocation
>
> AMX register is special. It needs to be configured before use and the 
> config instruction is expensive. To avoid unnecessary tile configure, 
> we collect the tile shape information as much as possible and combine 
> them into one ldtilecfg instruction. The ldtilecfg instruction should 
> dominate any AMX instruction that access tile register. On the other 
> side, the ldtilecfg should post-dominated the instruction that define 
> the tile shape. For tile register spill, it should avoid re-config due 
> to the different tile shape, the spilled register should be reloaded 
> to the register that share the same tile shape. Since tile register 
> allocation is special and it may allocate general virtual register to 
> configure tile register, we can add a sperate pass to do it before 
> general register allocation pass. After register allocation, the tile 
> shape information is not needed anymore, so we can transform the 
> pseudo AMX instruction to real AMX instruction by removing the row and 
> column operands.
>
Can you take advantage of our IPRA capability so that internal function 
calls might avoid this reconfiguration if the necessary configuration is 
always done in the caller?


How will the implementation of __builtin_setjmp/longjmp be affected?


Thanks again,

Hal

> 9.Use recommendation
>
> Due to the shape configure issue, we recommend user to define the tile 
> shape at the entry of the function entry and inline function as much 
> as possible. The AMX instructions focus on computation instead of 
> storage, so global variable for tile data is not recommended.
>
> Thanks
>
> Yuanke
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/44fb5a9d/attachment.html>

Philip Reames via llvm-dev

2020-Aug-14 17:17 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new programming 
> paradigm consisting of two components: a set of 2-dimensional 
> registers (tiles) representing sub-arrays from a larger 2-dimensional 
> memory image, and accelerators able to operate on tiles. Capability of 
> Intel AMX implementation is enumerated by palettes. Two palettes are 
> supported: palette 0 represents the initialized state and palette 1 
> consists of 8 tile registers of up to 1 KB size, which is controlled 
> by a tile control register.
>
> The instruction manual is posted at 
>
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
>
> The AMX abi proposal is posted at 
> https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4 
> <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
> This email is to discuss the programming model for AMX. Florian has 
> introduced the matrix type and intrinsics in LLVM community. We’d like 
> to adopt some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1. Data type.
>
> We’d like to have fixed vector type for AMX. Since the shape to AMX 
> register can be configurable, the vector size is the maximum size of 
> AMX register. That means the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data __attribute__((__vector_size__(1024), 
> __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x i32> 
> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile that can be 
> mapped to AMX registers.
>
> 2.AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX instructions. The 
> parameter m, n, k identifies the shape of the tile. The shape can be 
> variable, but it cannot exceed the size that AMX HW can support. 
> Compiler can deduce shape of the tile from the AMX intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const void *base, int 
> stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data 
> dst, _tile_data src1, _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n, short k, 
> _tile_data dst, _tile_data src1, _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void *base, int stride, 
> _tile_data tile);
>
> 3.User interfaces.
>
> The tile shape and tile data are combined into a struct in C language. 
> The shape of the tile is only allowed to be initialized once. The user 
> interface looks as this.
>
>    3  #define __DEFAULT_FN_AMX    \
>
>    4 __attribute__((__always_inline__, __nodebug__, 
> __target__("amx-int8")))
>
>    9 typedef struct __tile_str {
>
> 10   const char row;
>
> 11   const short col;
>
> 12   _tile_data tile;
>
> 13 }__tile;
>
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base, long stride) {
>
> 17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
> 22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, 
> dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile src) {
>
> 27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);
>
> 28 }
>
> 4.Example code
>
> The example shows how to use the user interface in a function.
>
>  51 void api(int cond, short row, short col) {
>
> 52   __tile a = {row, col};
>
> 53   __tile b = {row, col};
>
> 54   __tile c = {row, col};
>
> 55
>
> 56   if(cond) {
>
> 57 __tile_loadd(&a, buf, STRIDE);
>
> 58 __tile_loadd(&b, buf, STRIDE);
>
> 59 __tile_loadd(&c, buf, STRIDE);
>
> 60   } else {
>
> 61 __tile_loadd(&a, buf2, STRIDE);
>
> 62 __tile_loadd(&b, buf2, STRIDE);
>
> 63 __tile_loadd(&c, buf2, STRIDE);
>
> 64   }
>
> 65   __tile_dpbsud(&c, a, b);
>
> 66   __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.LLVM IR
>
> The LLVM intrinsics IR take the row and column information as the 
> input parameter, so that compiler can deduce the shape of tile data. 
> The remaining parameters are what AMX instructions require. This is 
> the LLVM IR corresponding to the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext 
> %col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14   %tobool = icmp eq i32 %cond, 0
>
> 15   %sext = shl i16 %col, 8
>
> 16   %conv.i31 = ashr exact i16 %sext, 8
>
> 17   br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then:                                          ; preds = %entry
>
> 20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, 
> i64 0, i64 0), i64 32) #3
>
> 21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, 
> i64 0, i64 0), i64 32) #3
>
> 22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, 
> i64 0, i64 0), i64 32) #3
>
> 23   br label %if.end
>
> 24
>
> 25 if.else:                                          ; preds = %entry
>
> 26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* 
> @buf2, i64 0, i64 0), i64 32) #3
>
> 27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* 
> @buf2, i64 0, i64 0), i64 32) #3
>
> 28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* 
> @buf2, i64 0, i64 0), i64 32) #3
>
> 29   br label %if.end
>
> 30
>
> 31 if.end:                                           ; preds = 
> %if.else, %if.then
>
> 32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0,
%if.then ]
>
> 33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1,
%if.then ]
>
> 34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2,
%if.then ]
>
> 35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 
> %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x
i32>
> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
> 36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, 
> i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 
> 0), i64 32, <256 x i32> %6) #3
>
> 37   ret void
>
> 38 }
>
> 6.Shape propagation
>
> When in -O0 build, some general load/store for tile vector is 
> generated by front-end. We need to root from AMX intrinsics to 
> propagate the shape information to the virtual tile register. If the 
> an AMX intrinsic use the result of load instruction, the shape is 
> propagated to the load and the load is transformed to tile load 
> intrinsic. If the store instruction uses any result of AMX intrinsic, 
> the shape is propagated to store instruction and the store is 
> transformed to tile store intrinsic
>
> 7.Machine IR
>
> Since the AMX intrinsics take the row and column as the input 
> parameters, we can create a pseudo instruction corresponding to it. 
> The AMX intrinsics are lowered to the pseudo AMX instruction which has 
> extra row and column operands corresponding to AMX intrinsic. The real 
> AMX instructions don’t need the row and column operands. The row and 
> column information should be configured by ldtilecfg before executing 
> any AMX instruction.
>
> 8.Register allocation
>
> AMX register is special. It needs to be configured before use and the 
> config instruction is expensive. To avoid unnecessary tile configure, 
> we collect the tile shape information as much as possible and combine 
> them into one ldtilecfg instruction. The ldtilecfg instruction should 
> dominate any AMX instruction that access tile register. On the other 
> side, the ldtilecfg should post-dominated the instruction that define 
> the tile shape. For tile register spill, it should avoid re-config due 
> to the different tile shape, the spilled register should be reloaded 
> to the register that share the same tile shape. Since tile register 
> allocation is special and it may allocate general virtual register to 
> configure tile register, we can add a sperate pass to do it before 
> general register allocation pass. After register allocation, the tile 
> shape information is not needed anymore, so we can transform the 
> pseudo AMX instruction to real AMX instruction by removing the row and 
> column operands.
>This seems complicated.

Reading through the documentation, there appears to be a single global 
tile config for all tile registers at any time.

Why not simply model this tile config as a designated special register 
and the tile instructions as having an implicit use of this register?  
That would seem to ensure that the register allocator has all the 
constraints needed.  You'd need to teach it how to spill the special 
registers with the appropriate instructions, but that seems a lot more 
straight forward?
> 9.Use recommendation
>
> Due to the shape configure issue, we recommend user to define the tile 
> shape at the entry of the function entry and inline function as much 
> as possible. The AMX instructions focus on computation instead of 
> storage, so global variable for tile data is not recommended.
>
> Thanks
>
> Yuanke
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/8246d12c/attachment-0001.html>

Luo, Yuanke via llvm-dev

2020-Aug-14 23:39 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

From: Hal Finkel <hfinkel at anl.gov>
Sent: Friday, August 14, 2020 11:27 PM
To: Luo, Yuanke <yuanke.luo at intel.com>; llvm-dev at lists.llvm.org;
florian_hahn at apple.com; Kaylor, Andrew <andrew.kaylor at intel.com>;
Topper, Craig <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at
intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.

On 8/14/20 8:27 AM, Luo, Yuanke via llvm-dev wrote:
Hi,
Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm
consisting of two components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory image, and
accelerators able to operate on tiles. Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are supported: palette 0 represents the
initialized state and palette 1 consists of 8 tile registers of up to 1 KB size,
which is controlled by a tile control register.
The instruction manual is posted at
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
The AMX abi proposal is posted at
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
This email is to discuss the programming model for AMX. Florian has introduced
the matrix type and intrinsics in LLVM community. We'd like to adopt some
ideas from it.
Here is what we propose for the AMX programming model.

1.        Data type.
We'd like to have fixed vector type for AMX. Since the shape to AMX register
can be configurable, the vector size is the maximum size of AMX register. That
means the vector size is 1024 bytes.
The C code may look like this.
typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));
_tile_data tile;
And the LLVM IR may look like this.
@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer,
align 64
For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX
registers.

2.       AMX Intrinsics.
The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n,
k identifies the shape of the tile. The shape can be variable, but it cannot
exceed the size that AMX HW can support. Compiler can deduce shape of the tile
from the AMX intrinsics.
_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);
_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data
tile);

3.       User interfaces.
The tile shape and tile data are combined into a struct in C language. The shape
of the tile is only allowed to be initialized once. The user interface looks as
this.
   3  #define __DEFAULT_FN_AMX    \
   4  __attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))
   9 typedef struct __tile_str {
10   const char row;
11   const short col;
12   _tile_data tile;
13 }__tile;

This interface look convenient, but what happens if one of these types appears
on a function-call boundary? Does this force everything to be spilled and
restored from the stack? Maybe this type needs some additional attribute to give
it a custom register-passing convention?

[Yuanke] We prefer the tile data is passed through memory across function call,
because passing though register is not as efficient as passing through memory.
Compiler allocate the tile register and configure it in callee, and the tile
register is re-configured in callee and all the tile data register is clear to
zero. So yes, this force everything to be spilled and restored from the stack.
14
15 __DEFAULT_FN_AMX
16 void __tile_loadd(__tile *dst, const void *base, long stride) {
17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
18 }
19
20 __DEFAULT_FN_AMX
21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);
23 }
24
25 __DEFAULT_FN_AMX
26 void __tile_stored(void *base, long stride, __tile src) {
27   _tile_stored_internal(src.row, src.col, base, stride, src.tile);
28 }

4.       Example code
The example shows how to use the user interface in a function.
 51 void api(int cond, short row, short col) {
52   __tile a = {row, col};
53   __tile b = {row, col};
54   __tile c = {row, col};
55
56   if(cond) {
57     __tile_loadd(&a, buf, STRIDE);
58     __tile_loadd(&b, buf, STRIDE);
59     __tile_loadd(&c, buf, STRIDE);
60   } else {
61     __tile_loadd(&a, buf2, STRIDE);
62     __tile_loadd(&b, buf2, STRIDE);
63     __tile_loadd(&c, buf2, STRIDE);
64   }
65   __tile_dpbsud(&c, a, b);
66   __tile_stored(buf, STRIDE, c);
67 }

5.       LLVM IR
The LLVM intrinsics IR take the row and column information as the input
parameter, so that compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the LLVM IR corresponding
to the example code.
12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col)
local_unnamed_addr #2 {
13 entry:
14   %tobool = icmp eq i32 %cond, 0
15   %sext = shl i16 %col, 8
16   %conv.i31 = ashr exact i16 %sext, 8
17   br i1 %tobool, label %if.else, label %if.then
18
19 if.then:                                          ; preds = %entry
20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
23   br label %if.end
24
25 if.else:                                          ; preds = %entry
26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
29   br label %if.end
30
31 if.end:                                           ; preds = %if.else,
%if.then
32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31,
i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3
37   ret void
38 }

6.       Shape propagation
When in -O0 build, some general load/store for tile vector is generated by
front-end. We need to root from AMX intrinsics to propagate the shape
information to the virtual tile register. If the an AMX intrinsic use the result
of load instruction, the shape is propagated to the load and the load is
transformed to tile load intrinsic. If the store instruction uses any result of
AMX intrinsic, the shape is propagated to store instruction and the store is
transformed to tile store intrinsic

7.       Machine IR
Since the AMX intrinsics take the row and column as the input parameters, we can
create a pseudo instruction corresponding to it. The AMX intrinsics are lowered
to the pseudo AMX instruction which has extra row and column operands
corresponding to AMX intrinsic. The real AMX instructions don't need the row
and column operands. The row and column information should be configured by
ldtilecfg before executing any AMX instruction.

8.       Register allocation
AMX register is special. It needs to be configured before use and the config
instruction is expensive. To avoid unnecessary tile configure, we collect the
tile shape information as much as possible and combine them into one ldtilecfg
instruction. The ldtilecfg instruction should dominate any AMX instruction that
access tile register. On the other side, the ldtilecfg should post-dominated the
instruction that define the tile shape. For tile register spill, it should avoid
re-config due to the different tile shape, the spilled register should be
reloaded to the register that share the same tile shape. Since tile register
allocation is special and it may allocate general virtual register to configure
tile register, we can add a sperate pass to do it before general register
allocation pass. After register allocation, the tile shape information is not
needed anymore, so we can transform the pseudo AMX instruction to real AMX
instruction by removing the row and column operands.

Can you take advantage of our IPRA capability so that internal function calls
might avoid this reconfiguration if the necessary configuration is always done
in the caller?

[Yuanke] I don't know IPRA capability and I am very interesting on it. Would
you post some linkage that introduce IPRA?

How will the implementation of __builtin_setjmp/longjmp be affected?

[Yuanke] That depends on the ABI. We propose all tile register is caller saved,
so I think setjmp/longjmp is not affected.

Thanks again,

Hal

9.       Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at
the entry of the function entry and inline function as much as possible. The AMX
instructions focus on computation instead of storage, so global variable for
tile data is not recommended.

Thanks
Yuanke

_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/14ccd6bc/attachment.html>

Luo, Yuanke via llvm-dev

2020-Aug-14 23:49 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

[Yuanke] AMX register is special. It needs to be configured before use and the
config instruction is expensive. To avoid unnecessary tile configure, we collect
the tile shape information as much as possible and combine them into one
ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX
instruction that access tile register. On the other side, the ldtilecfg should
post-dominated the instruction that define the tile shape. For tile register
spill, it should avoid re-config due to the different tile shape, the spilled
register should be reloaded to the register that share the same tile shape.
Since tile register allocation is special and it may allocate general virtual
register to configure tile register, we can add a sperate pass to do it before
general register allocation pass. After register allocation, the tile shape
information is not needed anymore, so we can transform the pseudo AMX
instruction to real AMX instruction by removing the row and column operands.

[Philip]

This seems complicated.

Reading through the documentation, there appears to be a single global tile
config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the
tile instructions as having an implicit use of this register?  That would seem
to ensure that the register allocator has all the constraints needed.  You'd
need to teach it how to spill the special registers with the appropriate
instructions, but that seems a lot more straight forward?
[Yuanke] In that case user need to configure the tile register by themselves.
Spilling configure register is very expensive, because it clears all the tile
data register to zero. In our proposal, compiler is responsible to deduce the
shape for virtual of tile data register, allocate physical registers for them
and then configure those physical register. We may build the dependency as you
proposed and it can be used for machine IR check to ensure tile data register is
configured before use.

From: Philip Reames <listmail at philipreames.com>
Sent: Saturday, August 15, 2020 1:17 AM
To: Luo, Yuanke <yuanke.luo at intel.com>; llvm-dev at lists.llvm.org;
florian_hahn at apple.com; Kaylor, Andrew <andrew.kaylor at intel.com>;
Topper, Craig <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at
intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
Hi,
Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm
consisting of two components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory image, and
accelerators able to operate on tiles. Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are supported: palette 0 represents the
initialized state and palette 1 consists of 8 tile registers of up to 1 KB size,
which is controlled by a tile control register.
The instruction manual is posted at
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
The AMX abi proposal is posted at
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
This email is to discuss the programming model for AMX. Florian has introduced
the matrix type and intrinsics in LLVM community. We'd like to adopt some
ideas from it.
Here is what we propose for the AMX programming model.

1.        Data type.
We'd like to have fixed vector type for AMX. Since the shape to AMX register
can be configurable, the vector size is the maximum size of AMX register. That
means the vector size is 1024 bytes.
The C code may look like this.
typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));
_tile_data tile;
And the LLVM IR may look like this.
@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer,
align 64
For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX
registers.

2.       AMX Intrinsics.
The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n,
k identifies the shape of the tile. The shape can be variable, but it cannot
exceed the size that AMX HW can support. Compiler can deduce shape of the tile
from the AMX intrinsics.
_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);
_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data
tile);

3.       User interfaces.
The tile shape and tile data are combined into a struct in C language. The shape
of the tile is only allowed to be initialized once. The user interface looks as
this.
   3  #define __DEFAULT_FN_AMX    \
   4  __attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))
   9 typedef struct __tile_str {
10   const char row;
11   const short col;
12   _tile_data tile;
13 }__tile;
14
15 __DEFAULT_FN_AMX
16 void __tile_loadd(__tile *dst, const void *base, long stride) {
17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
18 }
19
20 __DEFAULT_FN_AMX
21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);
23 }
24
25 __DEFAULT_FN_AMX
26 void __tile_stored(void *base, long stride, __tile src) {
27   _tile_stored_internal(src.row, src.col, base, stride, src.tile);
28 }


4.       Example code
The example shows how to use the user interface in a function.
 51 void api(int cond, short row, short col) {
52   __tile a = {row, col};
53   __tile b = {row, col};
54   __tile c = {row, col};
55
56   if(cond) {
57     __tile_loadd(&a, buf, STRIDE);
58     __tile_loadd(&b, buf, STRIDE);
59     __tile_loadd(&c, buf, STRIDE);
60   } else {
61     __tile_loadd(&a, buf2, STRIDE);
62     __tile_loadd(&b, buf2, STRIDE);
63     __tile_loadd(&c, buf2, STRIDE);
64   }
65   __tile_dpbsud(&c, a, b);
66   __tile_stored(buf, STRIDE, c);
67 }

5.       LLVM IR
The LLVM intrinsics IR take the row and column information as the input
parameter, so that compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the LLVM IR corresponding
to the example code.
12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col)
local_unnamed_addr #2 {
13 entry:
14   %tobool = icmp eq i32 %cond, 0
15   %sext = shl i16 %col, 8
16   %conv.i31 = ashr exact i16 %sext, 8
17   br i1 %tobool, label %if.else, label %if.then
18
19 if.then:                                          ; preds = %entry
20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
23   br label %if.end
24
25 if.else:                                          ; preds = %entry
26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
29   br label %if.end
30
31 if.end:                                           ; preds = %if.else,
%if.then
32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31,
i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3
37   ret void
38 }

6.       Shape propagation
When in -O0 build, some general load/store for tile vector is generated by
front-end. We need to root from AMX intrinsics to propagate the shape
information to the virtual tile register. If the an AMX intrinsic use the result
of load instruction, the shape is propagated to the load and the load is
transformed to tile load intrinsic. If the store instruction uses any result of
AMX intrinsic, the shape is propagated to store instruction and the store is
transformed to tile store intrinsic

7.       Machine IR
Since the AMX intrinsics take the row and column as the input parameters, we can
create a pseudo instruction corresponding to it. The AMX intrinsics are lowered
to the pseudo AMX instruction which has extra row and column operands
corresponding to AMX intrinsic. The real AMX instructions don't need the row
and column operands. The row and column information should be configured by
ldtilecfg before executing any AMX instruction.

8.       Register allocation
AMX register is special. It needs to be configured before use and the config
instruction is expensive. To avoid unnecessary tile configure, we collect the
tile shape information as much as possible and combine them into one ldtilecfg
instruction. The ldtilecfg instruction should dominate any AMX instruction that
access tile register. On the other side, the ldtilecfg should post-dominated the
instruction that define the tile shape. For tile register spill, it should avoid
re-config due to the different tile shape, the spilled register should be
reloaded to the register that share the same tile shape. Since tile register
allocation is special and it may allocate general virtual register to configure
tile register, we can add a sperate pass to do it before general register
allocation pass. After register allocation, the tile shape information is not
needed anymore, so we can transform the pseudo AMX instruction to real AMX
instruction by removing the row and column operands.

This seems complicated.

Reading through the documentation, there appears to be a single global tile
config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the
tile instructions as having an implicit use of this register?  That would seem
to ensure that the register allocator has all the constraints needed.  You'd
need to teach it how to spill the special registers with the appropriate
instructions, but that seems a lot more straight forward?

9.       Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at
the entry of the function entry and inline function as much as possible. The AMX
instructions focus on computation instead of storage, so global variable for
tile data is not recommended.

Thanks
Yuanke



_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/942ad926/attachment-0001.html>

Thomas Raoux via llvm-dev

2020-Aug-20 09:08 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

Hi Yuanke,

This is quite interesting. Did you put some thoughts on how this
extension will be exposed besides going through Clang? There has been
a lot of work recently on the MLIR side to represent multidimensional
vectors. Nicolas Vasilache presented it in the open design meeting
last week (slides:
https://drive.google.com/file/d/1_zPPxOILAIHOWoSM7GALwioYOGEgD2Xe/view,
recording:
https://drive.google.com/file/d/13jY4GTe7ZjFxqh3TCMBUh15HWoSGcswj/view).

It would be great to have MLIR also target AMX in the future. Looking
at the design I think a lot of it would match well with the direction
MLIR has taken. One thing that is not supported at the time - even
though it has been discussed - is dynamic vector size. Do you expect
this to be a common use case or is it supported for completeness?

It would be great to hear your thoughts on how AMX could be targeted
by MLIR if you have looked at it at all.

Thanks,
Thomas


On Fri, Aug 14, 2020 at 6:27 AM Luo, Yuanke via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm
consisting of two components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory image, and
accelerators able to operate on tiles. Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are supported: palette 0 represents the
initialized state and palette 1 consists of 8 tile registers of up to 1 KB size,
which is controlled by a tile control register.
>
> The instruction manual is posted at
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
>
> The AMX abi proposal is posted at
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
>
> This email is to discuss the programming model for AMX. Florian has
introduced the matrix type and intrinsics in LLVM community. We’d like to adopt
some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1.        Data type.
>
> We’d like to have fixed vector type for AMX. Since the shape to AMX
register can be configurable, the vector size is the maximum size of AMX
register. That means the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data __attribute__((__vector_size__(1024),
__aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x i32>
zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile that can be mapped
to AMX registers.
>
> 2.       AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX instructions. The parameter
m, n, k identifies the shape of the tile. The shape can be variable, but it
cannot exceed the size that AMX HW can support. Compiler can deduce shape of the
tile from the AMX intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const void *base, int
stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst,
_tile_data src1, _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data
dst, _tile_data src1, _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void *base, int stride,
_tile_data tile);
>
> 3.       User interfaces.
>
> The tile shape and tile data are combined into a struct in C language. The
shape of the tile is only allowed to be initialized once. The user interface
looks as this.
>
>    3  #define __DEFAULT_FN_AMX    \
>
>    4  __attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))
>
>    9 typedef struct __tile_str {
>
> 10   const char row;
>
> 11   const short col;
>
> 12   _tile_data tile;
>
> 13 }__tile;
>
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base, long stride) {
>
> 17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
> 22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile src) {
>
> 27   _tile_stored_internal(src.row, src.col, base, stride, src.tile);
>
> 28 }
>
>
>
> 4.       Example code
>
> The example shows how to use the user interface in a function.
>
>  51 void api(int cond, short row, short col) {
>
> 52   __tile a = {row, col};
>
> 53   __tile b = {row, col};
>
> 54   __tile c = {row, col};
>
> 55
>
> 56   if(cond) {
>
> 57     __tile_loadd(&a, buf, STRIDE);
>
> 58     __tile_loadd(&b, buf, STRIDE);
>
> 59     __tile_loadd(&c, buf, STRIDE);
>
> 60   } else {
>
> 61     __tile_loadd(&a, buf2, STRIDE);
>
> 62     __tile_loadd(&b, buf2, STRIDE);
>
> 63     __tile_loadd(&c, buf2, STRIDE);
>
> 64   }
>
> 65   __tile_dpbsud(&c, a, b);
>
> 66   __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.       LLVM IR
>
> The LLVM intrinsics IR take the row and column information as the input
parameter, so that compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the LLVM IR corresponding
to the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext
%col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14   %tobool = icmp eq i32 %cond, 0
>
> 15   %sext = shl i16 %col, 8
>
> 16   %conv.i31 = ashr exact i16 %sext, 8
>
> 17   br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then:                                          ; preds = %entry
>
> 20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
>
> 21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
>
> 22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
>
> 23   br label %if.end
>
> 24
>
> 25 if.else:                                          ; preds = %entry
>
> 26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
>
> 27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
>
> 28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
>
> 29   br label %if.end
>
> 30
>
> 31 if.end:                                           ; preds = %if.else,
%if.then
>
> 32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0,
%if.then ]
>
> 33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1,
%if.then ]
>
> 34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2,
%if.then ]
>
> 35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16
%conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
> 36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3
>
> 37   ret void
>
> 38 }
>
> 6.       Shape propagation
>
> When in -O0 build, some general load/store for tile vector is generated by
front-end. We need to root from AMX intrinsics to propagate the shape
information to the virtual tile register. If the an AMX intrinsic use the result
of load instruction, the shape is propagated to the load and the load is
transformed to tile load intrinsic. If the store instruction uses any result of
AMX intrinsic, the shape is propagated to store instruction and the store is
transformed to tile store intrinsic
>
> 7.       Machine IR
>
> Since the AMX intrinsics take the row and column as the input parameters,
we can create a pseudo instruction corresponding to it. The AMX intrinsics are
lowered to the pseudo AMX instruction which has extra row and column operands
corresponding to AMX intrinsic. The real AMX instructions don’t need the row and
column operands. The row and column information should be configured by
ldtilecfg before executing any AMX instruction.
>
> 8.       Register allocation
>
> AMX register is special. It needs to be configured before use and the
config instruction is expensive. To avoid unnecessary tile configure, we collect
the tile shape information as much as possible and combine them into one
ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX
instruction that access tile register. On the other side, the ldtilecfg should
post-dominated the instruction that define the tile shape. For tile register
spill, it should avoid re-config due to the different tile shape, the spilled
register should be reloaded to the register that share the same tile shape.
Since tile register allocation is special and it may allocate general virtual
register to configure tile register, we can add a sperate pass to do it before
general register allocation pass. After register allocation, the tile shape
information is not needed anymore, so we can transform the pseudo AMX
instruction to real AMX instruction by removing the row and column operands.
>
> 9.       Use recommendation
>
> Due to the shape configure issue, we recommend user to define the tile
shape at the entry of the function entry and inline function as much as
possible. The AMX instructions focus on computation instead of storage, so
global variable for tile data is not recommended.
>
>
>
> Thanks
>
> Yuanke
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


-- 
Thomas

Luo, Yuanke via llvm-dev

2020-Aug-23 10:04 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

Hi Thomas,

I'm not familiar with mlir. I read the link which is pointed by you, and I
think we may target AMX by lowering vector.transfer_read, vector.transfer_write
and vector.contract to AMX intrinsics. Through AMX intrinics accept both dynamic
vector size and fixed vector size, the fixed size is more friendly for register
allocation in LLVM. We'd like to work with community together to adapt AMX
to mlir.

Thanks
Yuanke

-----Original Message-----
From: Thomas Raoux <thomasraoux at google.com> 
Sent: Thursday, August 20, 2020 5:08 PM
To: Luo, Yuanke <yuanke.luo at intel.com>
Cc: llvm-dev at lists.llvm.org; florian_hahn at apple.com; Kaylor, Andrew
<andrew.kaylor at intel.com>; Topper, Craig <craig.topper at
intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.

Hi Yuanke,

This is quite interesting. Did you put some thoughts on how this extension will
be exposed besides going through Clang? There has been a lot of work recently on
the MLIR side to represent multidimensional vectors. Nicolas Vasilache presented
it in the open design meeting last week (slides:
https://drive.google.com/file/d/1_zPPxOILAIHOWoSM7GALwioYOGEgD2Xe/view,
recording:
https://drive.google.com/file/d/13jY4GTe7ZjFxqh3TCMBUh15HWoSGcswj/view).

It would be great to have MLIR also target AMX in the future. Looking at the
design I think a lot of it would match well with the direction MLIR has taken.
One thing that is not supported at the time - even though it has been discussed
- is dynamic vector size. Do you expect this to be a common use case or is it
supported for completeness?

It would be great to hear your thoughts on how AMX could be targeted by MLIR if
you have looked at it at all.

Thanks,
Thomas


On Fri, Aug 14, 2020 at 6:27 AM Luo, Yuanke via llvm-dev <llvm-dev at
lists.llvm.org> wrote:>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm
consisting of two components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory image, and
accelerators able to operate on tiles. Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are supported: palette 0 represents the
initialized state and palette 1 consists of 8 tile registers of up to 1 KB size,
which is controlled by a tile control register.
>
> The instruction manual is posted at
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
>
> The AMX abi proposal is posted at
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
>
> This email is to discuss the programming model for AMX. Florian has
introduced the matrix type and intrinsics in LLVM community. We’d like to adopt
some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1.        Data type.
>
> We’d like to have fixed vector type for AMX. Since the shape to AMX
register can be configurable, the vector size is the maximum size of AMX
register. That means the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data __attribute__((__vector_size__(1024), 
> __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x i32> 
> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile that can be mapped
to AMX registers.
>
> 2.       AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX instructions. The parameter
m, n, k identifies the shape of the tile. The shape can be variable, but it
cannot exceed the size that AMX HW can support. Compiler can deduce shape of the
tile from the AMX intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const void *base, int 
> stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data 
> dst, _tile_data src1, _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n, short k, 
> _tile_data dst, _tile_data src1, _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void *base, int stride, 
> _tile_data tile);
>
> 3.       User interfaces.
>
> The tile shape and tile data are combined into a struct in C language. The
shape of the tile is only allowed to be initialized once. The user interface
looks as this.
>
>    3  #define __DEFAULT_FN_AMX    \
>
>    4  __attribute__((__always_inline__, __nodebug__, 
> __target__("amx-int8")))
>
>    9 typedef struct __tile_str {
>
> 10   const char row;
>
> 11   const short col;
>
> 12   _tile_data tile;
>
> 13 }__tile;
>
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base, long stride) {
>
> 17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
> 22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile src) {
>
> 27   _tile_stored_internal(src.row, src.col, base, stride, src.tile);
>
> 28 }
>
>
>
> 4.       Example code
>
> The example shows how to use the user interface in a function.
>
>  51 void api(int cond, short row, short col) {
>
> 52   __tile a = {row, col};
>
> 53   __tile b = {row, col};
>
> 54   __tile c = {row, col};
>
> 55
>
> 56   if(cond) {
>
> 57     __tile_loadd(&a, buf, STRIDE);
>
> 58     __tile_loadd(&b, buf, STRIDE);
>
> 59     __tile_loadd(&c, buf, STRIDE);
>
> 60   } else {
>
> 61     __tile_loadd(&a, buf2, STRIDE);
>
> 62     __tile_loadd(&b, buf2, STRIDE);
>
> 63     __tile_loadd(&c, buf2, STRIDE);
>
> 64   }
>
> 65   __tile_dpbsud(&c, a, b);
>
> 66   __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.       LLVM IR
>
> The LLVM intrinsics IR take the row and column information as the input
parameter, so that compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the LLVM IR corresponding
to the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext 
> %col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14   %tobool = icmp eq i32 %cond, 0
>
> 15   %sext = shl i16 %col, 8
>
> 16   %conv.i31 = ashr exact i16 %sext, 8
>
> 17   br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then:                                          ; preds = %entry
>
> 20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
>
> 21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
>
> 22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0,
i64 0), i64 32) #3
>
> 23   br label %if.end
>
> 24
>
> 25 if.else:                                          ; preds = %entry
>
> 26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
>
> 27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
>
> 28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0,
i64 0), i64 32) #3
>
> 29   br label %if.end
>
> 30
>
> 31 if.end:                                           ; preds = %if.else,
%if.then
>
> 32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0,
%if.then ]
>
> 33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1,
%if.then ]
>
> 34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2,
%if.then ]
>
> 35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16
%conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
> 36   tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3
>
> 37   ret void
>
> 38 }
>
> 6.       Shape propagation
>
> When in -O0 build, some general load/store for tile vector is 
> generated by front-end. We need to root from AMX intrinsics to 
> propagate the shape information to the virtual tile register. If the 
> an AMX intrinsic use the result of load instruction, the shape is 
> propagated to the load and the load is transformed to tile load 
> intrinsic. If the store instruction uses any result of AMX intrinsic, 
> the shape is propagated to store instruction and the store is 
> transformed to tile store intrinsic
>
> 7.       Machine IR
>
> Since the AMX intrinsics take the row and column as the input parameters,
we can create a pseudo instruction corresponding to it. The AMX intrinsics are
lowered to the pseudo AMX instruction which has extra row and column operands
corresponding to AMX intrinsic. The real AMX instructions don’t need the row and
column operands. The row and column information should be configured by
ldtilecfg before executing any AMX instruction.
>
> 8.       Register allocation
>
> AMX register is special. It needs to be configured before use and the
config instruction is expensive. To avoid unnecessary tile configure, we collect
the tile shape information as much as possible and combine them into one
ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX
instruction that access tile register. On the other side, the ldtilecfg should
post-dominated the instruction that define the tile shape. For tile register
spill, it should avoid re-config due to the different tile shape, the spilled
register should be reloaded to the register that share the same tile shape.
Since tile register allocation is special and it may allocate general virtual
register to configure tile register, we can add a sperate pass to do it before
general register allocation pass. After register allocation, the tile shape
information is not needed anymore, so we can transform the pseudo AMX
instruction to real AMX instruction by removing the row and column operands.
>
> 9.       Use recommendation
>
> Due to the shape configure issue, we recommend user to define the tile
shape at the entry of the function entry and inline function as much as
possible. The AMX instructions focus on computation instead of storage, so
global variable for tile data is not recommended.
>
>
>
> Thanks
>
> Yuanke
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Thomas

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Aug 2020 - Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

Maybe Matching Threads